03.Graphics in R
03.Graphics in R
GILLESPIE
A D VA N C E D R G R A P H I C S
NEWCASTLE UNIVERSITY
Contents
1 Background 4
2 ggplot2 overview 8
3 Plot building 11
5 Reshaping data 24
6 R setup 26
Bibliography 27
“ I F I C A N ’ T P I C T U R E I T , I C A N ’ T U N D E R S TA N D I T . ”
ALBERT EINSTEIN.
“ T H E G R E AT E S T VA L U E O F A P I C T U R E I S W H E N I T F O R C E S U S T O
N O T I C E W H AT W E N E V E R E X P E C T E D T O S E E . ”
JOHN TUKEY.
1
Background
manufacturer model displ year cyl trans cty hwy class Table 1.1: The last five cars in the mpg
dataset. The variables cty and hwy
volkswagen passat 2.0 2008 4 auto(s6) 19 28 midsize record miles per gallon for city and
volkswagen passat 2.0 2008 4 manual(m6) 21 29 midsize highway driving respectively. The vari-
volkswagen passat 2.8 1999 6 auto(l5) 16 26 midsize able displ is the engine displacement
in litres.
volkswagen passat 2.8 1999 6 manual(m5) 18 26 midsize
volkswagen passat 3.6 2008 6 auto(s6) 17 26 midsize
tip($),
bill($),
There were a total of 244 tips. The first few rows of this data set are
shown in table 1.2. The data comes with the reshape2 package and is
loaded using the data function:
The data comes with the reshape2 package and is loaded using the
data function:
R> library(reshape2)
R> data(tips)
6 dr colin s. gillespie
http://imdb.com/help/show_leaf?about
including information about the data collection process IMDB makes their raw data available
at http://uk.imdb.com/interfaces/.
http://imdb.com/help/show_leaf?infosource
Example rows are given in table 1.1. This data set contains information
on over 50,000 movies. We will use this dataset to illustrate the concepts
covered in this class. This is the full version of the data set
The dataset contains the following fields: used in the Introduction to R course.
r1: Multiplying by ten gives the percentage (to the nearest 10%) of
users who rated this movie a 1.
for the user to iteratively add new components and for a developer to
add new functionality.
+ col=2) ● ●
●
mpg[mpg$cyl == 4, ]$cty
●
25
● ●
● ● ●● ● ● ●
● ●● ● ●● ● ● ●
●● ●● ● ● ● ●
●● ● ●● ●● ● ● ● ● ●
● ● ●● ● ● ●● ●
●● ●● ● ● ●
+ col=4) ● ●
● ● ●
●●
● ● ●
●●● ● ● ●
10
1 2 3 4 5 6 7 8
We have to manually set the scales in the plot command using xlim mpg[mpg$cyl == 4, ]$displ
Let’s now consider the equivalent ggplot2 graphic - figure 2.2. After
loading the necessary library, the plot is generated using the following
code:
advanced r graphics 9
35 ●
R> g + geom_point(aes(colour=factor(cyl))) 30
●
● ●
cty
● 5
The ggplot function sets the default data set, and attributes called 20
● ● ● ●●
●
●
●
●
● 6
● ● ●● ● ● ● ● 8
aesthetics. The aesthetics are properties that are perceived on the ●
●
●●
●●
● ●● ● ●
●● ●● ●
●
●
● ● ●
● ● ● ● ● ●
15 ●● ● ●● ●● ● ● ● ● ●
● ● ● ● ● ●
●
●●● ● ● ●
The other function, geom_point adds a layer to the plot. The x and Figure 2.2: As figure 2.1, but created
y variables are inherited (in this case) from the first function, ggplot, using ggplot2.
and the colour aesthetic is set to the cyl variable. Other possible 35 ●
aesthetics are, for example, size, shape and transparency. In figure 2.2 30
●
25 ● ●
factor(cyl)
This approach is very powerful and enables us to easily create ●
●
●
●
●
4
cty
● ● ● 5
complex graphics. For example, we could create a plot where the size 20
● ●
●
●
●
●
● ●
●
●
●
● ●●
●
●8
6
●
● ●●●●● ●
● ●
●
●●●● ●● ●
15 ● ●●● ●●●
●
● ● ● ●
● ●●●● ● ●● ●
●● ●●● ●●
R> p = g + geom_point(aes(size=factor(cyl))) ●● ● ●● ●
10
● ● ● ● ●
●● ●●●
●
which gives figure 2.3 or we could create a line chart 2 3 4
displ
5 6 7
to get figure 2.4. Of course, figures 2.3 and 2.4 aren’t particular good 35
Table 2.1 summarises some standard geoms and their equivalent base 20
6
8
30
●
● ●
25 ● ● factor(cyl)
● ● ● 5
● ● ● ●●
● 6
20 ● ●
● 8
15 ●● ● ●● ●● ● ● ● ● ●
● ● ●● ● ● ●● ●
●● ●● ● ● ●
2 3 4 5 6 7
stat_* which create a layer with a specific stat (and various defaults
including a geom) or
qplot which creates a ggplot and a layer. qplot is short for quick plot. I don’t
cover qplot in this course. If you find
yourself using ggplot2 a lot, then it is
worth the time investment.
3
Plot building
data and
an aesthetic mapping.
These arguments set up the defaults for the various layers that are added
to the plot and can be empty. For each plot layer, these arguments can
be overwritten. The data argument is straightforward - it is a data
frame1 . The mapping argument creates default aesthetic attributes. 1
ggplot2 is very strict regarding the
For example data argument. It doesn’t accept ma-
trices or vectors. The underlying phi-
R> g = ggplot(data=mpg, losophy is that ggplot2 takes care of
plotting, rather than messaging it into
+ mapping=aes(x=displ, y=cty, colour=factor(cyl))) other forms. If you want to do some
data manipulation, then use other
or equivalently, tools.
R> g = ggplot(mpg, aes(displ, cty, colour=factor(cyl)))
The above commands don’t actually produce anything to be displayed,
we need to add layers for that to happen.
8
smooth Add a smoothed condition mean ●
6 ●
●
tip
4
of all the tips data, a more useful plot would be to have individual ●
R> g2 = g + geom_boxplot(aes(group=size)) ●
6 ● ●
●
tip
Notice that we have included a group aesthetic to the boxplot geom. ●
●
then we would have individual lines for each size - this doesn’t make
much sense in this scenario. 2
When data sets are reasonably small, it is useful to display the data on size
●●
● ● ●
●● ●
R> g3 = g2 + 6 ●●
●●
●● ● smoker
No
tip
● ●
● ●●
+ geom_jitter(position=position_jitter(width=0.3), ● ●
● ●● ●
●
● ●●
●
●●
●
●
●
●
● ● ● ● Yes
●● ●
● ●
4 ●●●●
●
● ● ●● ● ● ● ● ● ●
+ aes(colour=smoker)) ●●
●
●
●● ●
●● ● ●
●●
●●
●●
●
●●
●
●
●●
●
●●
●●●
●●●
● ●
● ●
●
●
●
●●●● ●●●●●●
● ●● ●● ● ● ●● ●
● ● ●
● ●
●● ● ●
● ●● ● ●
●●
●●●● ●● ● ● ●
● ● ● ●
This generates figure 3.3. Since the points would all fall on straight 2 ● ●
●● ●
●
●●●●
●●
● ●●
● ●●
●●
●
●●
●●
●
●
● ●
●●
● ●
●
●
●
●●●●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●●
●
●●
●
●● ● ● ●
● ●● ●
● ●
lines, we use the jitter geom to wiggle the points about their axis. We
● ● ● ● ●
1 2 3 4 5 6
size
also colour the points conditional on whether someone at the table Figure 3.3: As figure 3.2, but including
smoked using the colour aesthetic. the data points.
There are a few standard geom’s that are particular useful: 40000
NC−17 PG PG−13 R
mpaa
For example, to generate a bar plot in figure 3.4 of the MPAA ratings
in the movie data set, we use the following code: 10
+ geom_bar()
z
6 0.2
0.4
y
0.6
0.8
and vjust control the horizontal and vertical position. The angle
aesthetic controls the text angle. 2 4 6 8 10
x
●
●
40 ●
●
●
●
●
● z
●
30 ●
generates figure 3.5. If the squares are unequal, then use the (slower) ●
●
●
●
●
●
0
2
y
● ● 4
geom_tile function. ●
●
●
●
● 6
20 ● ● 8
●
●
●
●
●
●
●
●
●
10 ●
3.4 Aesthetics ●
●
●
●
●
●
●
●
●
●
●
factor(z)
● ● 0
R> g_aes + geom_point(aes(colour = z)) ●
●
●
●
● 1
● ● 2
30 ●
●
● ● 3
which gives figure 3.6. Here the z variable has been mapped to the ●
●
● 4
y
●
●
●
● 5
●
8
9
●
●
●
●
●
10 20 30 40 50
to get figure 3.7. If we set the aesthetic to a constant value (figure 3.8) x
40 ●
●
●
●
●
●
30 ●
"Blue"
y
●
●
● Blue
ggplot(aes()), these mappings are inherited by every subsequent ●
●
●
20 ●
●
layer. This is fine for x and y, but can cause trouble for other aesthetics. ●
●
●
●
●
●
For example, using the colour aesthetic is fine for geom_line, but may 10
●
●
●
●
●
●
●
10 20 30 40 50
x
There are few standard aesthetics that appear in most, but not
all, geom’s and stat’s (see table 3.2). Individual geom’s can have
additional optional and required aesthetics. See their help file for
further information.
lower
upper
middle
Typically, these statistics are used by the boxplot geom. Equally, they
could be used by the error bar geom.
A widely used stat, is identity. This stat does not alter the underlying
data and is used by a number of geoms, such as geom_point and
geom_line.
tip
3
step Create stair steps See geom_step
●
at every unique x ●
A simple plot to create, is the mean tip amount based on table size,
figure 3.9: 1.05
●
●
In the above piece of code we calculate the mean tip size for each unique
tip
0.95
x value, that is, for different table sizes. These x-y values are passed to
0.90
the point geom. We can use any function for fun.y provided it takes
in a vector and returns a single point. For example, we could calculate 0.85
1 2 3 4 5 6
R> g5 = g + stat_summary(geom="point", size
6
As with the geom example, we can combine multiple stats:
R> g6 = g + 5
+ width=0.2) +
2
+ stat_smooth(aes(colour=smoker, lty=smoker),
+ se=FALSE, method="lm") 1
1 2 3 4 5 6
size
Using the stat_summary function, we have created error bars that Figure 3.11: The IQR of the tip
span the inter quantile range. The stat_smooth function plots the amount displayed using error bars.
The stat_smooth function is used
regression lines, conditional on whether someone on the table smokes -
to add OLS regression lines, condi-
figure 3.11. tional on whether anyone in the party
smoked.
16 dr colin s. gillespie
3.6 Facets
Faceting is a mechanism for automatically laying out multiple plots on
a page. The data is split into subsets, with each subset plotted onto a
different panel. ggplot2 has two types of faceting:
into 2d.
4000
count
3.6.1 Facet grid 3000
2000
The function facet_grid lays out the plots in a 2d grid. The faceting
formula specifies the variables that appear in the columns and rows. 1000
there are a couple of outlying films and adjusted the binwidth in the
histogram. The data is clearly bimodal. Some movies are fairly short, 0.02
0
whilst others have an average length of around one hundred minutes. 0.01
0.00
1
R> g + geom_histogram(aes(y=..density..), binwidth=3) +
+ facet_grid(Comedy ~ .) 0.01
This gives figure 3.13. Since there are many more non-comedy than 0.00
R> g + geom_density(aes(y=..density..)) +
density
0.15
+ facet_grid(. ~ Animation)
0.10
From figure 3.14, it’s clear that the majority of short films are ani-
mations. For illustration purposes, we have used the geom_density 0.05
0
0.10
0.05
0.00
0.25
0.20
density
0.15
1
0.10
0.05
0.00
0.25
0.20
0.15
(all)
0.10
0.05
0.00
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
length
scales = ‘fixed’: x and y scales are fixed across all panels (de-
fault).
3000
2000
1000
0
count
4000
3000
2000
1000
0
0 50 100 150 2000 50 100 150 2000 50 100 150 2000 50 100 150 2000 50 100 150 2000 50 100 150 200
length
500
duce the y-axis scale. We can specify set scales using the xlim and
ylim functions. However, if we use these functions, any data that 400
falls outside of the plotting region isn’t plotted and isn’t used in
statistical transformations. For example, when calculating the bin- 300
length
width in histograms. If you want to zoom into a plot region, then use
200
coord_cartesian(xlim = c(..,..)) instead.
100
to get figure 3.17. Notice that we have changed the alpha transparency
value to help with over plotting. 300
length
To plot the log budgets, there are two possibilities. First, we could
200
transform the scale
R> h2 = h + geom_point(aes(log10(budget)), alpha=0.2) 100
we are still using the original scale. To generate figure 3.19 we used
scale_x_log10() this is a convenience function of the 400
are fundamentally different from geom’s, since they don’t add a layer 200
to the plot.
The scale_* functions can also adjust the tick marks and labels. 100
For example,
0
then you can use the convenience functions xlim and xlab. There are
similar functions for the y-axis.
isn’t yet finalised. In particular, version 0.9 the default grid lines when
using a log transformation don’t appear as a regular grid.
themes: if you want to make consistent changes to all your plots - say
reduce the font size, then you should use themes. One useful theme is
theme_bw(). This can be set globally using theme_set(theme_bw())
or using the standard notation: + theme_bw().
0.8
0.6
count
4.1 The dot geom 0.4
You can think of a dot plot as a one-dimensional scatter plot, where tied 0.2
values are perturbed. There are two basic algorithms for generating a
dot plot. 0.0
10 15 20 25 30
mpg
1. dot density: uses a kernel density estimation algorithm to position Figure 4.1: A dot plot of mpg using
dots. geom_dot. This is the default dotplot,
using the dotdensity method.
2. Dots can be stacked in different ways – see the stackdir argument. 0.4
●
3. Altering the closeness of dots – see the stackratio argument. 0.2
● ●
●● ●●●
● ●●●●●●● ●
To create a dot plot, we use the dotplot geom: 0.0 ● ●●●●●●●●●●● ●●●
10 15 20 25 30
mpg
R> ##default dotdensity method (left)
Figure 4.2: A dot plot of mpg using
R> g = ggplot(mtcars, aes(x = mpg)) geom_dot. This plot was constructed
R> g + geom_dotplot(binwidth = 1.5) using the histodot method.
8
R> g + geom_dotplot(method="histodot", binwidth = 1.5) ●
●
●●
●●
●
●●
tip
Dot plots are particular useful, when combined with boxplots. Using ●●
●
●●● ●
●●●
●
●●
●
●● ●●
●●
●
●
●●●
● ●
●●●
4 ●●●●●●
●
●●●● ●●
● ●●
● ●
● ●
●
●●●●●● ●●●●●● ●
●● ●
●
●●
●●●● ●
●●● ●
●● ●●
●● ●
●●
●●●
●
●
●
●●●●
●
●
●
●
●
●●●●● ●
●
●
2 ●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●● ●●●● ●
+ geom_boxplot(aes(group=size)) + ●
● ●
●●●
●●
●●●
●●●
●●●●●●●●●
●●●●●
●
●●●
●●
●
●
●● ●●●
+ geom_dotplot(aes(group=size), 1 2 3 4 5 6
size
+ binaxis="y", stackdir="center", Figure 4.3: Box- and dot-plots of the
+ colour="blue", fill="blue", tips data set conditional on table size.
+ binwidth=0.05, stackratio=0.5)
to get figure 4.3.
22 dr colin s. gillespie
The error bar geom provides a mechanism for adding error bars to your
plot. Three closely related geoms are:
The error bar geom has three required aesthetics: x, ymin and ymax.
Suppose we wanted to create a graphic where the error bars are the
mean tip size ± two standard errors. First, we calculate the some
summary statistics
R> tip_m = tapply(tips$tip, tips$size, mean)
R> tip_sd = tapply(tips$tip, tips$size, sd)
R> tip_l = tapply(tips$tip, tips$size, length)
Next we create a vector containing the standard errors multiplied by
the corresponding t statistic, i.e.
s 6
t n −1 √
n 5
to get 4
you! Now that the data is in the correct form, we can apply the errorbar
geom (figure 4.4): 5
4
R> h1 = ggplot(df) +
m
R> h2 = ggplot(df) + 1
10 ● ●
● ● ●
6 ● ● ● ●
●
tip
● ●
●
● ● ●●●
●
●
●
●●● ● ●● ●● ●●
●●
●
●
●●●
● ●
4 ●●●
●●●●●●
●
●
●●●●
●
●●
●●
●
● ●
●
●●●●●● ●●●●●● ●
●● ●
●
●●
●●●● ●
●●● ●
●● ●●
●● ●
●●●●●●●●●●●●●●●● ●●● ●●●● ●
●
●
●
●●●● ●
●
●●
●●●●
●●●●●●●●● ● ●
●
● ●
●●
●●●●● ●
●
2
●
●●●●● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●● ●
●
● ●
●●●
●● ●●
●●●
●●●
●●●●●●●●●
●●●●●
●
●
●●●
●
●
●● ●●●
1 2 3 4 5 6
size
6 6
5
5
4
4
m
3
3
2
2
1
1
0
1 2 3 4 5 6 1 2 3 4 5 6
x x
R> library(grid)
R> vplayout = function(x, y)
+ viewport(layout.pos.row = x, layout.pos.col = y)
Next we create a new page, with a 2 × 2 layout
R> grid.newpage()
R> pushViewport(viewport(layout = grid.layout(2, 2)))
Finally, we add the individual graphics. The plot created using the h
object, is placed on the first row and spans both columns:
R> print(h, vp = vplayout(1, 1:2))
The others figures are placed on the second row (figure 4.6):
R> print(h1, vp = vplayout(2, 1))
R> print(h2, vp = vplayout(2, 2))
5
Reshaping data
2. Melt the remaining columns into a single column; use the id argu-
ment to identify the protected columns
which gives
advanced r graphics 25
R> pat_comb
Gene variable value
1 A Patient1 10.1
2 B Patient1 9.1
3 C Patient1 5.6
4 A Patient2 15.2
5 B Patient2 10.2
6 C Patient2 4.8
7 A Patient3 20.5
8 B Patient3 8.7
9 C Patient3 5.1
The examples in the notes are generated using the following R setup:
R> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 14.2
year 2012
month 02
day 29
svn rev 58522
language R
version.string R version 2.14.2 (2012-02-29)
To obtain the version number of a particular package, use
R> packageDescription("ggplot2")$Version
[1] "0.9.0"
The packages used in these notes are given in table 6.1.