0% found this document useful (0 votes)
6 views

03.Graphics in R

Uploaded by

Antonello Sala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

03.Graphics in R

Uploaded by

Antonello Sala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

DR COLIN S.

GILLESPIE

A D VA N C E D R G R A P H I C S

NEWCASTLE UNIVERSITY
Contents

1 Background 4

2 ggplot2 overview 8

3 Plot building 11

4 A few other things 21

5 Reshaping data 24

6 R setup 26

Bibliography 27
“ I F I C A N ’ T P I C T U R E I T , I C A N ’ T U N D E R S TA N D I T . ”
ALBERT EINSTEIN.

“ T H E G R E AT E S T VA L U E O F A P I C T U R E I S W H E N I T F O R C E S U S T O
N O T I C E W H AT W E N E V E R E X P E C T E D T O S E E . ”
JOHN TUKEY.
1
Background

1.1 Installing packages

Installing packages in R is straightforward. To install a package from the


command line we use the install.packages command. For example,
R> install.packages("ggplot2")
R> library(ggplot2)
For this course, the packages we use are given in chapter 6, table ??. To
update packages with their latest version, we use the update.packages()
command. However, you may need root access to update all packages.

1.2 Types of R graphics

1.2.1 Base graphics


Base graphics were written by Ross Ihaka based on his experience of
implementing the S graphics driver. If you have created a histogram,
scatter plot or boxplot, you’ve probably used base graphics. Base
graphics are generally fast, but have limited scope. For example, you
can only draw on top of the plot and cannot edit or alter existing
graphics. For example, if you combine the plot and points commands,
you have to work out the x- and y- limits before adding the points.

1.2.2 Grid graphics


Grid graphics were developed by Paul Murrell1 . Grid grobs (graphical 1
P Murrell. R Graphics. CRC Press,
objects) can be represented independently of the plot and modified later. 2 edition, 2011

The viewports system makes it easier to construct complex plots. Grid


doesn’t provide tools for graphics, it provides primitives for creating
plots. Lattice and ggplot2 graphics use grid.

1.2.3 Lattice graphics


The lattice package uses grid graphics to implement the trellis graphics
system2 . It produces nicer plots than base graphics and legends are 2
D Sarkar. Lattice: Multivariate
automatically generated. I initially started using lattice before ggplot2. Data Visualization with R (Use R!).
Springer, 1st edition, 2008
However, I found it a bit confusing and so switched to ggplot2.
advanced r graphics 5

manufacturer model displ year cyl trans cty hwy class Table 1.1: The last five cars in the mpg
dataset. The variables cty and hwy
volkswagen passat 2.0 2008 4 auto(s6) 19 28 midsize record miles per gallon for city and
volkswagen passat 2.0 2008 4 manual(m6) 21 29 midsize highway driving respectively. The vari-
volkswagen passat 2.8 1999 6 auto(l5) 16 26 midsize able displ is the engine displacement
in litres.
volkswagen passat 2.8 1999 6 manual(m5) 18 26 midsize
volkswagen passat 3.6 2008 6 auto(s6) 17 26 midsize

1.2.4 ggplot2 graphics


ggplot2 started in 20053 and follows the “Grammar of Graphics”4 Like 3
H Wickham. ggplot2: Elegant Graph-
lattice, ggplot2 uses grid to draw graphics, which means you can ics for Data Analysis. Springer, New
York, 2009. ISBN 978-0-387-98140-6
exercise low-level control over the plot appearance. 4
We’ll come on to that later.

1.3 Data sets


Throughout the course, we will use a few different datasets.

1.3.1 Fuel economy data


This dataset includes car make, model, class, engine size and fuel
economy for a selection of US cars in 1999 and 2008. It is included
with the ggplot2 package5 and is loaded using the data function: 5
The data originally comes from the
EPA fuel economy website, http://
R> library(ggplot2) fueleconomy.gov
R> data(mpg)
Table 1.1 gives the last five cars in this data set.

1.3.2 The tips data set


A single waiter recorded information about each tip he received over a
few months while working in a particular restaurant. He collected data
on several variables

ˆ tip($),

ˆ bill($),

ˆ gender of the bill payer,

ˆ whether there were smokers in the party,

ˆ day of the week6 6


The waiter only worked Thursday,
Friday, Saturday and Sundays.
ˆ time of day,
ˆ party size.

There were a total of 244 tips. The first few rows of this data set are
shown in table 1.2. The data comes with the reshape2 package and is
loaded using the data function:
The data comes with the reshape2 package and is loaded using the
data function:
R> library(reshape2)
R> data(tips)
6 dr colin s. gillespie

Table 1.2: The first five rows of the


total bill tip sex smoker day time size
tips data set. There are 244 rows in
16.99 1.01 Female No Sun Dinner 2 this data set.

10.34 1.66 Male No Sun Dinner 3


21.01 3.50 Male No Sun Dinner 3
23.68 3.31 Male No Sun Dinner 2
24.59 3.61 Female No Sun Dinner 4

1.3.3 Movie data set


The internet movie database7 is a website devoted to collecting movie 7
http://imdb.com/
data supplied by studios and fans. It claims to be the biggest movie
database on the web and is run by amazon. More information about
IMDB can be found online at

http://imdb.com/help/show_leaf?about

including information about the data collection process IMDB makes their raw data available
at http://uk.imdb.com/interfaces/.
http://imdb.com/help/show_leaf?infosource

Example rows are given in table 1.1. This data set contains information
on over 50,000 movies. We will use this dataset to illustrate the concepts
covered in this class. This is the full version of the data set
The dataset contains the following fields: used in the Introduction to R course.

ˆ Title. Title of the movie.

ˆ Year. Year of release.

ˆ Budget. Total budget in US dollars. If the budget isn’t known, then


it is stored as ‘-1’.

ˆ Length. Length in minutes.

ˆ Rating. Average IMDB user rating.

ˆ Votes. Number of IMDB users who rated this movie.

ˆ r1: Multiplying by ten gives the percentage (to the nearest 10%) of
users who rated this movie a 1.

ˆ r2 – r10: Similar to r1.

ˆ mpaa. The MPAA rating - PG, PG-13, R, NC-17.

ˆ Action, Animation, Comedy, Drama, Documentary, Romance, Short.


Binary variables representing if movie was classified as belonging to
that genre. A movie can belong to more one genre. See for example
the film Ablaze in table 1.3.

This data set is part of the ggplot2 package:


R> library(ggplot2)
R> data(movies)
advanced r graphics 7

Voting statistics Movie genre


Title Year Length Budget Rating Votes r1 ... r10 mpaa Action Animation Comedy Drama Documentary Romance Short
A.k.a. Cassius 1970 85 -1 5.7 43 4.5 ... 14.5 PG 0 0 0 0 1 0 0
AKA 2002 123 -1 6.0 335 24.5 ... 14.5 R 0 0 0 1 0 0 0
Alien Vs. Pred 2004 102 45000000 5.4 14651 4.5 ... 4.5 PG-13 1 0 0 0 0 0 0
Abandon 2002 99 25000000 4.7 2364 4.5 ... 4.5 PG-13 0 0 0 1 0 0 0
Abendland 1999 146 -1 5.0 46 14.5 ... 24.5 R 0 0 0 0 0 0 0
Aberration 1997 93 -1 4.8 149 14.5 ... 4.5 R 0 0 0 0 0 0 0
Abilene 1999 104 -1 4.9 42 0.0 ... 24.5 PG 0 0 0 1 0 0 0
Ablaze 2001 97 -1 3.6 98 24.5 ... 14.5 R 1 0 0 1 0 0 0
Abominable Dr 1971 94 -1 6.7 1547 4.5 ... 14.5 PG-13 0 0 0 0 0 0 0
About Adam 2000 105 -1 6.4 1303 4.5 ... 4.5 R 0 0 1 0 0 1 0

Table 1.3: Sample rows of the movie


data set. Credit: This data set was
initially constructed by Hadley Wick-
ham at http://had.co.nz/.
2
ggplot2 overview

ggplot2 is a bit different from other graphics packages. It roughly


follows the philosophy of Wilkinson, 19991 . Essentially, we think about 1
L Wilkinson. The Grammar of
plots as layers. By thinking of graphics in terms of layers it is easier Graphics. Springer, 1st edition, 1999

for the user to iteratively add new components and for a developer to
add new functionality.

2.1 A basic plot using base graphics


A reasonable first attempt at analysing this data would be to produce a
scatter plot of (for example), engine displacement against city miles per
gallon. To use base graphics, we would first construct a basic scatter
plot of the data where the cylinder size is 4:2 2
We’ve cheated here and pretended
that we know the x- and y- limits.
R> plot(mpg[mpg$cyl==4,]$displ,
+ mpg[mpg$cyl==4,]$cty,
+ xlim=c(1,8), ylim=c(5,35))
Next we add in the other cars corresponding to different cylinder sizes:
35

R> points(mpg[mpg$cyl==5,]$displ, mpg[mpg$cyl==5,]$cty, ●


30

+ col=2) ● ●

mpg[mpg$cyl == 4, ]$cty


25

R> points(mpg[mpg$cyl==6,]$displ, mpg[mpg$cyl==6,]$cty, ● ●


● ●
● ●
● ●
● ● ● ●●
+ col=3)
20

● ●
● ● ●● ● ● ●
● ●● ● ●● ● ● ●
●● ●● ● ● ● ●

R> points(mpg[mpg$cyl==8,]$displ, mpg[mpg$cyl==8,]$cty, ● ●● ● ● ● ● ● ● ●


15

●● ● ●● ●● ● ● ● ● ●
● ● ●● ● ● ●● ●
●● ●● ● ● ●

+ col=4) ● ●
● ● ●
●●
● ● ●
●●● ● ● ●
10

This would produce figure 2.1. A few points to note:


5

1 2 3 4 5 6 7 8

ˆ We have to manually set the scales in the plot command using xlim mpg[mpg$cyl == 4, ]$displ

and ylim. Figure 2.1: A scatter plot of engine


displacement vs average city miles per
gallon. The coloured points correspond
ˆ We haven’t created a legend. We would need to use the legend
to different cylinder sizes. The plot was
function. constructed using base graphics.

ˆ The default axis labels are terrible - mpg[mpg$cyl==4,]$displ

ˆ If we wanted to look at highway miles per gallon, this is a bit of a


pain.

Let’s now consider the equivalent ggplot2 graphic - figure 2.2. After
loading the necessary library, the plot is generated using the following
code:
advanced r graphics 9

Plot Name Geom Base graphic


Barchart bar barplot
Box-and-whisker boxplot boxplot
Histogram histogram hist Table 2.1: Basic geom’s and their cor-
Line plot line plot and lines responding standard plot names.
Scatter plot point plot and points

35 ●

R> g = ggplot(data=mpg, aes(x=displ, y=cty)) ●

R> g + geom_point(aes(colour=factor(cyl))) 30

● ●

The ggplot2 code is fundamentally different from the base code. 25 ● ●


● ●
factor(cyl)
● 4
● ●

cty
● 5
The ggplot function sets the default data set, and attributes called 20
● ● ● ●●



● 6
● ● ●● ● ● ● ● 8
aesthetics. The aesthetics are properties that are perceived on the ●


●●

●●
● ●● ● ●
●● ●● ●


● ● ●
● ● ● ● ● ●

15 ●● ● ●● ●● ● ● ● ● ●

graphic. A particular aesthetic can be mapped to a variable or set to ● ● ●● ●


●● ●●


●●
● ●

● ● ● ● ● ●

a constant value. In figure 2.2, the variable displ is mapped to the 10


● ● ●●


●●● ● ● ●

x-axis and cty variable is mapped to the y-axis. 2 3 4


displ
5 6 7

The other function, geom_point adds a layer to the plot. The x and Figure 2.2: As figure 2.1, but created
y variables are inherited (in this case) from the first function, ggplot, using ggplot2.

and the colour aesthetic is set to the cyl variable. Other possible 35 ●

aesthetics are, for example, size, shape and transparency. In figure 2.2 30

these additional aesthetics are left at their default value. ● ●

25 ● ●
factor(cyl)
This approach is very powerful and enables us to easily create ●




4

cty
● ● ● 5

complex graphics. For example, we could create a plot where the size 20
● ●



● ●


● ●●

●8
6


● ●●●●● ●
● ●

of the points depends on an additional factor: ●


●●●●● ●●●
●● ● ●


●●●● ●● ●
15 ● ●●● ●●●

● ● ● ●
● ●●●● ● ●● ●
●● ●●● ●●
R> p = g + geom_point(aes(size=factor(cyl))) ●● ● ●● ●
10
● ● ● ● ●
●● ●●●

which gives figure 2.3 or we could create a line chart 2 3 4
displ
5 6 7

R> p = g + geom_line( Figure 2.3: As figure 2.2, but where


the size aesthetic depends on cylinder
+ aes(colour=factor(cyl), size = factor(cyl))) size.

to get figure 2.4. Of course, figures 2.3 and 2.4 aren’t particular good 35

plots, they just illustrate the general idea. 30

Points, bars and lines are all examples of geom’s or geometric


25 factor(cyl)

objects. Typically, if we use a single geom, we get a standard plot. 4


cty

Table 2.1 summarises some standard geoms and their equivalent base 20
6
8

graphic counter part. 15

However using the idea of a graphical grammar, we can construct


10
more complicated functions. For example, this code
2 3 4 5 6 7
displ
R> p = g + geom_point(aes(colour=factor(cyl))) +
Figure 2.4: As figure 2.2, but using
+ stat_smooth(aes(colour=factor(cyl))) geom_line.

produces figure 2.5, which doesn’t really have a simple name. 35 ●

30

● ●

In each ggplot2 command, we are adding (multiple) layers. A single ●

25 ● ● factor(cyl)

layer comprises of four elements: ● ●


● ●
● 4
cty

● ● ● 5
● ● ● ●●
● 6
20 ● ●
● 8

ˆ an aesthetic and data mapping;


● ● ●● ● ● ●
● ●● ● ●● ● ● ●
●● ●● ● ● ● ●
● ●● ● ● ● ● ● ● ●

15 ●● ● ●● ●● ● ● ● ● ●
● ● ●● ● ● ●● ●
●● ●● ● ● ●

ˆ a statistical transformation (stat); 10


● ●
● ●
●●

●●●
● ● ●
● ● ●

2 3 4 5 6 7

ˆ a geometric object (geom); displ

Figure 2.5: As figure 2.2, but with loess


regression lines.
10 dr colin s. gillespie

ˆ and a position adjustment, i.e. how should objects that overlap be


handled.

When we use the command


R> g + geom_point(aes(colour=factor(cyl)))
this is actually a shortcut for the command:
R> g + layer(
+ data = mpg,#inherited
+ mapping = aes(color=factor(cyl)),#x,y are inherited
+ stat = "identity",
+ geom = "point",
+ position = "identity"
+ )
In practice, we never use the layer function. Instead, we use

ˆ geom_* which creates a layer with a specific geom (and various


defaults including a stat) and/or

ˆ stat_* which create a layer with a specific stat (and various defaults
including a geom) or

ˆ qplot which creates a ggplot and a layer. qplot is short for quick plot. I don’t
cover qplot in this course. If you find
yourself using ggplot2 a lot, then it is
worth the time investment.
3
Plot building

3.1 The basic plot object


To create an initial ggplot object, we use the ggplot() function. This
function has two arguments:

ˆ data and

ˆ an aesthetic mapping.

These arguments set up the defaults for the various layers that are added
to the plot and can be empty. For each plot layer, these arguments can
be overwritten. The data argument is straightforward - it is a data
frame1 . The mapping argument creates default aesthetic attributes. 1
ggplot2 is very strict regarding the
For example data argument. It doesn’t accept ma-
trices or vectors. The underlying phi-
R> g = ggplot(data=mpg, losophy is that ggplot2 takes care of
plotting, rather than messaging it into
+ mapping=aes(x=displ, y=cty, colour=factor(cyl))) other forms. If you want to do some
data manipulation, then use other
or equivalently, tools.
R> g = ggplot(mpg, aes(displ, cty, colour=factor(cyl)))
The above commands don’t actually produce anything to be displayed,
we need to add layers for that to happen.

3.2 Geometric objects


geom’s or geometric objects are used to perform the actual rendering
in a plot. For example, we have already seen that a line geom will
create a line plot and a point geom creates a scatter plot. Each geom
has a list of aesthetics that it expects2 . However, some geoms have 2
For example, x, y, colour and size.
unique elements. The error bar geom requires arguments ymax and
ymin. Table 3.1 gives some standard geoms.3 3
For a full list, see table 4.2 of the
ggplot2 book or online at http://had.
co.nz/ggplot2/.
3.2.1 Example: combining geoms
Let’s look at the tips data set - see §1.3.2 for a description. We begin
by creating a base ggplot object
12 dr colin s. gillespie

Table 3.1: A few standard geom’s in


Name Description
ggplot2.
abline Line, specified by slope and intercept
boxplot Box and whiskers plot
density Kernel density plot
density 2d Contours from a 2s density estimate
histogram Histograms 10 ●

jitter Individual points are jittered to avoid overlap ●

8
smooth Add a smoothed condition mean ●

step Connect observations by stairs ●


6 ●

tip
4

R> g = ggplot(tips, aes(x=size, y=tip))


Remember, the above piece of code doesn’t do anything. Now we’ll 2

create a boxplot using the boxplot geom:


2 3 4 5
size
R> g1 = g + geom_boxplot() Figure 3.1: A boxplot of tips earned
by the waiter.
This produces figure 3.1. Notice that the default axis labels are the
column headings of the associated data frame. Figure 3.1 is a boxplot 10 ●

of all the tips data, a more useful plot would be to have individual ●

boxplots conditional on table size 8

R> g2 = g + geom_boxplot(aes(group=size)) ●

6 ● ●

tip
Notice that we have included a group aesthetic to the boxplot geom. ●

Many geom’s have this aesthetic. For example, if we used geom_line, 4

then we would have individual lines for each size - this doesn’t make
much sense in this scenario. 2

We are not restricted to a single geom - we can add multiple geoms.


1 2 3 4 5 6

When data sets are reasonably small, it is useful to display the data on size

Figure 3.2: A boxplots of tips, condi-


top of the boxplots: tional on table size.
R> ##We need to jitter the points to avoid overlap 10 ●●

●●

R> ##We colour the points depending on whether the


8

R> ##person is smoker ●

● ● ●
●● ●

R> g3 = g2 + 6 ●●
●●
●● ● smoker
No
tip

● ●

● ●●

+ geom_jitter(position=position_jitter(width=0.3), ● ●
● ●● ●

● ●●


●●




● ● ● ● Yes

●● ●
● ●
4 ●●●●

● ● ●● ● ● ● ● ● ●

+ aes(colour=smoker)) ●●


●● ●
●● ● ●
●●
●●
●●

●●


●●

●●
●●●
●●●
● ●
● ●



●●●● ●●●●●●
● ●● ●● ● ● ●● ●
● ● ●
● ●
●● ● ●
● ●● ● ●
●●
●●●● ●● ● ● ●
● ● ● ●

This generates figure 3.3. Since the points would all fall on straight 2 ● ●
●● ●

●●●●
●●
● ●●
● ●●
●●

●●
●●


● ●
●●
● ●



●●●●





●●

● ●

●●

●●


●● ● ● ●

● ●● ●
● ●

lines, we use the jitter geom to wiggle the points about their axis. We
● ● ● ● ●

1 2 3 4 5 6
size
also colour the points conditional on whether someone at the table Figure 3.3: As figure 3.2, but including
smoked using the colour aesthetic. the data points.

3.3 Standard plots 50000

There are a few standard geom’s that are particular useful: 40000

ˆ geom_boxplot: produces a boxplot - see figure 3.1. 30000


count

ˆ geom_point: a scatter plot - see figure 3.3. 20000

ˆ geom_bar: produces a standard barplot that counts the x values. 10000

NC−17 PG PG−13 R
mpaa

Figure 3.4: A bar chart of the MPAA


rating.
advanced r graphics 13

For example, to generate a bar plot in figure 3.4 of the MPAA ratings
in the movie data set, we use the following code: 10

R> h = ggplot(movies, aes(x=mpaa)) + 8

+ geom_bar()
z
6 0.2
0.4

ˆ geom_line: a line plot - see practical 3.

y
0.6
0.8

ˆ geom_text: adds labels to specified points. This has an additional


(required) aesthetic: label. Other useful aesthetics, such as hjust 2

and vjust control the horizontal and vertical position. The angle
aesthetic controls the text angle. 2 4 6 8 10
x

Figure 3.5: A heatmap of some exam-


ˆ geom_raster: Similar to levelplot or image. For example, ple data using geom_raster. New to
version 0.9.
R> set.seed(1)
50 ●

R> example = expand.grid(x=1:10, y=1:10) ●




R> example$z = runif(100) ●




40 ●

R> ggplot(example, aes(x, y)) + geom_raster(aes(fill=z)) ●






● z

30 ●

generates figure 3.5. If the squares are unequal, then use the (slower) ●




0
2

y
● ● 4
geom_tile function. ●


● 6
20 ● ● 8









10 ●

3.4 Aesthetics ●






The key to successfully using aesthetics is remembering that the aes() 10 20 30 40 50


x
function maps data to an aesthetic. If the parameter is not data or is Figure 3.6: Illustration of the continu-
constant, then don’t put it in an aesthetic. Only parameters that are ous colour aesthetic.
inside of an aes() will appear in the legend. To illustrate these ideas,
we’ll generate a simple scatter-plot: 50 ●


R> d = data.frame(x=1:50, y = 1:50, z = 0:9) ●





R> g_aes = ggplot(d, aes(x = x, y = y)) 40





factor(z)
● ● 0
R> g_aes + geom_point(aes(colour = z)) ●


● 1
● ● 2
30 ●

● ● 3
which gives figure 3.6. Here the z variable has been mapped to the ●

● 4
y




● 5

colour aesthetic. Since this parameter is continuous, ggplot2 uses 20






● 6

● 7

a continuous colour palette. Alternatively, if make z a factor or a ●




● ●


8
9

character, ggplot2 uses a different colour palette: 10








R> g_aes + geom_point(aes(colour=factor(z))) ●



10 20 30 40 50
to get figure 3.7. If we set the aesthetic to a constant value (figure 3.8) x

Figure 3.7: Illustration of the discrete


R> g_aes + geom_point(aes(colour="Blue")) colour aesthetic.
50 ●

the resulting plot is unlikely to be what we intended. The value ‘Blue’ ●




is just treated as a standard factor. Instead, you probably wanted ●




40 ●


R> g_aes + geom_point(colour="Blue") ●






30 ●

Another important point, is that when you specify mappings inside ●




"Blue"
y



● Blue
ggplot(aes()), these mappings are inherited by every subsequent ●

20 ●

layer. This is fine for x and y, but can cause trouble for other aesthetics. ●



For example, using the colour aesthetic is fine for geom_line, but may 10



not be suitable for geom_text. ●






10 20 30 40 50
x

Figure 3.8: Illustration of a constant


colour aesthetic.
14 dr colin s. gillespie

Table 3.2: Standard aesthetics. Indi-


Aesthetic Description
vidual geom’s may have other aesthet-
linetype Similar to lty in base graphics ics. For example, geom_text uses la-
bel and geom_boxplot has, amongst
colour Similar to col in base graphics other things, upper.
size Similar to size in base graphics
fill See figure 3.5.
shape Glyph choice
alpha Control the transparency

There are few standard aesthetics that appear in most, but not
all, geom’s and stat’s (see table 3.2). Individual geom’s can have
additional optional and required aesthetics. See their help file for
further information.

3.5 Statistical transformations


Statistical transformations or stat’s, transform the data. For example,
in figure 2.5 we use a loess smoother function (conditional on the
number of cylinders) to plot the overall data trend. Remember, all
geoms have stats and, vice visa, all stats have geoms.
A stat takes a dataset as input and returns a dataset as an output.
For example, the boxplot stat4 takes in a data set and produces the 4
Used by the boxplot geom.
following variables:

ˆ lower

ˆ upper

ˆ middle

ˆ ymin: bottom (vertical minimum)

ˆ ymax: top (vertical maximum).

Typically, these statistics are used by the boxplot geom. Equally, they
could be used by the error bar geom.
A widely used stat, is identity. This stat does not alter the underlying
data and is used by a number of geoms, such as geom_point and
geom_line.

3.5.1 Example: combining stats


Perhaps the easiest stat to consider is the stat_summary function.
This function summarises y values at every unique x value. This is
quite handy, for example, when adding single points that summarise
the data or adding error bars.
advanced r graphics 15

Table 3.3: Standard stat’s in gpplot2.


Name Description Comment
bin Bin data histogram
boxplot Calculates the components See geom_boxplot
of box-and-whisker plots
contour Contours of 3d data
density 1d density estimation
density 2d 2d density estimation
function Superimpose a function
identity Leave the data untouched Used in most geoms
qq Calculation for q-q plots

quantile Continuous quantiles


smooth Add a smoother ●

4

spoke Convert angle and radius


to xend and yend ●

tip
3
step Create stair steps See geom_step

sum Sum unique values


summary Summarises y values 2

at every unique x ●

unique Remove duplicates 1 2 3


size
4 5 6

Figure 3.9: Average tip amount condi-


tional on table size.

A simple plot to create, is the mean tip amount based on table size,
figure 3.9: 1.05

R> g4 = g + stat_summary(geom="point", fun.y= mean) 1.00

In the above piece of code we calculate the mean tip size for each unique
tip

0.95

x value, that is, for different table sizes. These x-y values are passed to
0.90

the point geom. We can use any function for fun.y provided it takes
in a vector and returns a single point. For example, we could calculate 0.85

the ratio of the mean and median, as in figure 3.10: ●

1 2 3 4 5 6
R> g5 = g + stat_summary(geom="point", size

Figure 3.10: The ratio of the mean


+ fun.y= function(i) mean(i)/median(i))
to median tip amount conditional on
table size.

6
As with the geom example, we can combine multiple stats:
R> g6 = g + 5

+ stat_summary(fun.ymin = function(i) quantile(i, 0.05), smoker


4 No
tip

+ fun.ymax = function(i) quantile(i, 0.95), Yes

+ colour = "blue", geom="errorbar", 3

+ width=0.2) +
2

+ stat_smooth(aes(colour=smoker, lty=smoker),
+ se=FALSE, method="lm") 1

1 2 3 4 5 6
size
Using the stat_summary function, we have created error bars that Figure 3.11: The IQR of the tip
span the inter quantile range. The stat_smooth function plots the amount displayed using error bars.
The stat_smooth function is used
regression lines, conditional on whether someone on the table smokes -
to add OLS regression lines, condi-
figure 3.11. tional on whether anyone in the party
smoked.
16 dr colin s. gillespie

3.6 Facets
Faceting is a mechanism for automatically laying out multiple plots on
a page. The data is split into subsets, with each subset plotted onto a
different panel. ggplot2 has two types of faceting:

ˆ facet_grid: produces a 2d panel of plots where variables define


rows and columns. 6000

ˆ facet_wrap: produces a 1d ribbon of panels which can be wrapped 5000

into 2d.
4000

count
3.6.1 Facet grid 3000

2000
The function facet_grid lays out the plots in a 2d grid. The faceting
formula specifies the variables that appear in the columns and rows. 1000

Suppose we are interested in movie length. A first plot we could 0

generate is a basic histogram: 0 50 100


length
150 200

R> g = ggplot(movies, aes(x=length)) + xlim(0, 200) Figure 3.12: A histogram of movie


length.
R> g + geom_histogram(binwidth=3)
This produces figure 3.12. Notice that we have altered the x-axis since 0.03

there are a couple of outlying films and adjusted the binwidth in the
histogram. The data is clearly bimodal. Some movies are fairly short, 0.02

0
whilst others have an average length of around one hundred minutes. 0.01

We will now use faceting to explore the data further.


density

0.00

ˆ y ∼ .: a single column with multiple rows. This can be handy


for double column journals. For example, to create histograms 0.03

conditional on whether they are comedy films, we use:


0.02

1
R> g + geom_histogram(aes(y=..density..), binwidth=3) +
+ facet_grid(Comedy ~ .) 0.01

This gives figure 3.13. Since there are many more non-comedy than 0.00

0 50 100 150 200


length
comedy films, we use the density in the histogram (look at the
Figure 3.13: Movie length conditional
y-axis). on whether it is a comedy.
0 1

ˆ . ∼ x: a single row with multiple columns. Very useful in wide


screen monitors. In this piece of code, we create kernel density plots, 0.25

conditional on whether the movie was animated:


0.20

R> g + geom_density(aes(y=..density..)) +
density

0.15

+ facet_grid(. ~ Animation)
0.10

From figure 3.14, it’s clear that the majority of short films are ani-
mations. For illustration purposes, we have used the geom_density 0.05

function in figure 3.14. 0.00

0 50 100 150 200 0 50 100 150 200

ˆ y ∼ x: multiple rows and columns. Typically the variable with the


length

Figure 3.14: Density plots of movie


greatest number of factors is used for the columns. We can also add length conditional on animation.
marginal plots when using facet_grid. By default, margin=FALSE.

R> g + geom_histogram(aes(y=..density..), binwidth=3) +


+ facet_grid(Comedy ~ Animation, margin=TRUE)
advanced r graphics 17

0 1 (all) Figure 3.15: Movie length condi-


tional on animation and action status.
0.25
Marginal histograms are along the top
0.20 column and the right hand column.
0.15

0
0.10

0.05

0.00

0.25

0.20
density

0.15

1
0.10

0.05

0.00

0.25

0.20

0.15

(all)
0.10

0.05

0.00
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
length

Figure 3.15 splits movie length by comedy and animation. Since we


set margin=TRUE, we also have the marginal plots. Notice that the
plot in the bottom right corner is the same as figure 3.12.

The panel labels aren’t that helpful - they are either 0 or 1. By


default ggplot2 uses the values set in the data frame. Typically I
use more descriptive names in my data frame so the default is more
appropriate.

3.6.2 Controlling facet scales


For both facet_grid and facet_wrap we can allow the scale to be the
same in all panels (fixed) or vary between panels. This is controlled by
the scales parameter in the facet_* function:

ˆ scales = ‘fixed’: x and y scales are fixed across all panels (de-
fault).

ˆ scales = ‘free’: x and y scales vary across all panels.

ˆ scales = ‘free_x’: the x scale is free.

ˆ scales = ‘free_y: the y scale is free.

We will experiment with these in the practical session.


18 dr colin s. gillespie

1890 1900 1910 1920 1930 1940


Figure 3.16: Movie length conditional
on the decade the movie was created.
4000

3000

2000

1000

0
count

1950 1960 1970 1980 1990 2000

4000

3000

2000

1000

0
0 50 100 150 2000 50 100 150 2000 50 100 150 2000 50 100 150 2000 50 100 150 2000 50 100 150 200
length

Table 3.4: Standard scales in ggplot2.


Function Description
In the above, replace * with either
*_continuous(...) Main scale function. scale_x or scale_y. Common argu-
ments are breaks, labels, na.value,
*_log10(...) log10 transformation. trans and limits. See the help files
*_reverse(...) Reverse the axis. for further details.
*_sqrt(...) The square root transformation.
*_datetime(...) Precise control over dates and times.
*_discrete(...) Not usually needed - see §6.3 of Wickham, 2009.

3.6.3 Facet wrap


The facet_wrap function creates a 1d ribbon of plots. This can be
quite handy when trying to save space. To illustrate, let’s examine
movie length by decade. First, we a create new variable for the movie
decade:5 5
The function round_any is part of the
plyr package.
R> movies$decade = round_any(movies$year, 10, floor)
Then to generate the ribbon of histograms histograms, we use the
facet_wrap function:
R> ggplot(movies, aes(x=length)) + geom_histogram() +
+ facet_wrap( ~ decade, ncol=6) + xlim(0, 200)
As before, we truncate the x-axis. Since we have counts on the y-axis,
we notice that the number of movies made has increased through time.
Also, shorter movies were popular in the 1950’s and 1960’s.

3.7 Axis Scales

When we create complex plots involving multiple layers, ggplot2 uses


an iterative process to calculate the correct scales. For example, if
in figure 3.6 we only plotted the regression lines, ggplot2 would re-
advanced r graphics 19

500
duce the y-axis scale. We can specify set scales using the xlim and
ylim functions. However, if we use these functions, any data that 400

falls outside of the plotting region isn’t plotted and isn’t used in
statistical transformations. For example, when calculating the bin- 300

length
width in histograms. If you want to zoom into a plot region, then use
200
coord_cartesian(xlim = c(..,..)) instead.
100

At times, we may want to transform the data. A standard example is


the log transformation. Suppose we wanted to create a scatter plot of 0

5.0e+07 1.0e+08 1.5e+08 2.0e+08


budget
length against budget. We remove any movies that have a zero budget
Figure 3.17: Scatter plot of movie bud-
or length. Then we could use the following commands get against length.
R> h = ggplot(subset(movies, length>0 & budget>0),
+ aes(y=length)) + ylim(0, 500) 500

R> h1 = h + geom_point(aes(budget), alpha=0.2)


400

to get figure 3.17. Notice that we have changed the alpha transparency
value to help with over plotting. 300

length
To plot the log budgets, there are two possibilities. First, we could
200
transform the scale
R> h2 = h + geom_point(aes(log10(budget)), alpha=0.2) 100

to get figure 3.18. Note that ylim(0, 500) is shorthand for 0

scale_y_continuous(limits=c(0, 500)). Alternatively, we can trans- 3 4 5 6


log10(budget)
7 8

form the data: Figure 3.18: Scatter plot of movie


log10(budget) against length.
R> h3 = h1 + scale_x_log10()
R> ##Or equivalently
R> h1 + scale_x_continuous(trans="log10")
to get figure 3.19. Figures 3.18 and 3.19 are identical, but in figure 3.19 500

we are still using the original scale. To generate figure 3.19 we used
scale_x_log10() this is a convenience function of the 400

scale_x_continuous(trans=‘log10’) function. Some standard scale


300
transformations are given in table 3.4. As an aside, the scale functions
length

are fundamentally different from geom’s, since they don’t add a layer 200

to the plot.
The scale_* functions can also adjust the tick marks and labels. 100

For example,
0

R> h4 = h3 + 1e+03 1e+05 1e+07


budget

+ scale_y_continuous(breaks=seq(0,500, 100), Figure 3.19: Scatter plot of movie bud-


+ limits=c(0,500), get against length, with the budget
data transformed.
+ minor_breaks = seq(0, 500, 25),
labels=c(0, '', "", "", '', 500),
500
+
+ name="Movie Length")
gives figure 3.20. If you just want to change the x-axis limits or name,
Movie Length

then you can use the convenience functions xlim and xlab. There are
similar functions for the y-axis.

The above description of axis scales is based on what happened


in version 0.89. However, version 0.9 seems to be slightly different, but 0

1e+03 1e+05 1e+07


budget

Figure 3.20: Scatter plot of movie


budget against length. Using
scale_y_continuous gives us more
control of tick marks and grid lines.
20 dr colin s. gillespie

isn’t yet finalised. In particular, version 0.9 the default grid lines when
using a log transformation don’t appear as a regular grid.

3.8 Other topics


There are a few topics that I have skipped, mainly due to space and
time.

ˆ themes: if you want to make consistent changes to all your plots - say
reduce the font size, then you should use themes. One useful theme is
theme_bw(). This can be set globally using theme_set(theme_bw())
or using the standard notation: + theme_bw().

ˆ coordinate systems: unlike transforming data or scales, transforming


the coordinate system transforms the appearance of the geoms. For
example, a rectangle becomes a doughnut; in a map projection, the
shortest path will no longer be a straight line. See §7.3 of the ggplot2
book for further details.

ˆ Multiple plots: this includes having sub-figures on top of larger


figures or multiple plots on a single page. See §8.4 in the ggplot2
book.

ˆ Legend manipulation: changing legend titles and positions.

ˆ There is also a geom_map for plotting maps. However, I haven’t


really used this in earnest. There is also a ggmap package that might
be worth looking at.
4
A few other things 1.0

0.8

0.6

count
4.1 The dot geom 0.4

You can think of a dot plot as a one-dimensional scatter plot, where tied 0.2

values are perturbed. There are two basic algorithms for generating a
dot plot. 0.0

10 15 20 25 30
mpg

1. dot density: uses a kernel density estimation algorithm to position Figure 4.1: A dot plot of mpg using
dots. geom_dot. This is the default dotplot,
using the dotdensity method.

2. “histodot” has regular spacing between stacks. 1.0

The dots in a dot plots can be manipulated in a variety of ways: 0.8

1. The size of a dot. 0.6


count

2. Dots can be stacked in different ways – see the stackdir argument. 0.4


3. Altering the closeness of dots – see the stackratio argument. 0.2
● ●
●● ●●●
● ●●●●●●● ●
To create a dot plot, we use the dotplot geom: 0.0 ● ●●●●●●●●●●● ●●●
10 15 20 25 30
mpg
R> ##default dotdensity method (left)
Figure 4.2: A dot plot of mpg using
R> g = ggplot(mtcars, aes(x = mpg)) geom_dot. This plot was constructed
R> g + geom_dotplot(binwidth = 1.5) using the histodot method.

to create figure 4.1. The binwidth argument controls the number of


data points that are represented by the a single dot. The other standard 10 ●●

method for constructing dot plots is the histodot method: ●●

8
R> g + geom_dotplot(method="histodot", binwidth = 1.5) ●


●●

to get figure 4.2. 6 ●●


●●

●●

●●
tip

Dot plots are particular useful, when combined with boxplots. Using ●●

●●● ●
●●●

●●

●● ●●

●●

the tips data set again, we get



●●●
● ●
●●●

4 ●●●●●●

●●●● ●●

● ●●
● ●
● ●

●●●●●● ●●●●●● ●
●● ●

●●
●●●● ●
●●● ●
●● ●●
●● ●

R> h = ggplot(tips, aes(x=size, y=tip)) + ●●●●●●●●●●●●●●●●




●●●●
●●
●●●●
●●●●●●●●●

●●
●●●



●●●●





●●●●● ●

2 ●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●● ●●●● ●

+ geom_boxplot(aes(group=size)) + ●
● ●
●●●
●●
●●●
●●●
●●●●●●●●●
●●●●●


●●●
●●



●● ●●●

+ geom_dotplot(aes(group=size), 1 2 3 4 5 6
size
+ binaxis="y", stackdir="center", Figure 4.3: Box- and dot-plots of the
+ colour="blue", fill="blue", tips data set conditional on table size.
+ binwidth=0.05, stackratio=0.5)
to get figure 4.3.
22 dr colin s. gillespie

4.2 The error bar geom

The error bar geom provides a mechanism for adding error bars to your
plot. Three closely related geoms are:

ˆ geom_pointrange: range indicated by straight line, with a point in


the middle.

ˆ geom_linerange: range indicated by straight line.

ˆ geom_errorbarh: a horizontal geom_errobar.

The error bar geom has three required aesthetics: x, ymin and ymax.
Suppose we wanted to create a graphic where the error bars are the
mean tip size ± two standard errors. First, we calculate the some
summary statistics
R> tip_m = tapply(tips$tip, tips$size, mean)
R> tip_sd = tapply(tips$tip, tips$size, sd)
R> tip_l = tapply(tips$tip, tips$size, length)
Next we create a vector containing the standard errors multiplied by
the corresponding t statistic, i.e.

s 6

t n −1 √
n 5

to get 4

R> tip_se = qt(0.975, tip_l)* tip_sd/sqrt(tip_l) 3

Then we put this data into a data frame 2

R> df = data.frame(x = 1:6, 1

+ ymin = tip_m - tip_se, 1 2 3 4 5 6


x
+ ymax = tip_m + tip_se,
Figure 4.4: A 95% confidence interval
+ m = tip_m) for the mean tip amount, conditional
on group size.
Notice that we prepare the data before attempting to use ggplot2.
Remember, ggplot2 doesn’t try to manipulate the data, that’s up to
6

you! Now that the data is in the correct form, we can apply the errorbar
geom (figure 4.4): 5

4
R> h1 = ggplot(df) +
m

+ geom_errorbar(aes(x=x, ymin=ymin, ymax=ymax)) 3

We could go a step further and 2

R> h2 = ggplot(df) + 1

+ geom_errorbar(aes(x=x, ymin=ymin, ymax=ymax)) + 0

+ geom_bar(aes(x=x, y=m), stat="identity") 1 2 3


x
4 5 6

Figure 4.5: As figure 4.4, but with a


to to get figure 4.5.1 bar plot layer (aka dynamite plot).
1
Personally, I really dislike these plots.
See for example http://goo.gl/RvAaK
If we want to add a dot to the error bar to represent the mean or and http://goo.gl/jGTUs.
median, then we just use geom_point to create an additional layer.
advanced r graphics 23

10 ● ●

Figure 4.6: An example plot using the


●●

viewports. The top plot is spans two


8

columns.


● ● ●

6 ● ● ● ●

tip

● ●

● ● ●●●



●●● ● ●● ●● ●●

●●


●●●
● ●

4 ●●●
●●●●●●



●●●●


●●

●●

● ●

●●●●●● ●●●●●● ●
●● ●

●●
●●●● ●
●●● ●
●● ●●
●● ●
●●●●●●●●●●●●●●●● ●●● ●●●● ●



●●●● ●

●●
●●●●
●●●●●●●●● ● ●

● ●
●●
●●●●● ●

2

●●●●● ●
●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●● ●

● ●
●●●
●● ●●
●●●
●●●
●●●●●●●●●
●●●●●


●●●


●● ●●●

1 2 3 4 5 6
size

6 6

5
5

4
4
m

3
3
2
2
1

1
0

1 2 3 4 5 6 1 2 3 4 5 6
x x

4.3 Multiple plots


When we want to create a figure in base graphics that contains multiple
plots, we use the par function. For example, to create a 2 × 2 plot, we
would use
R> par(mfrow=c(2, 2))
In ggplot2, we can do something similar. Using the gridExtra package,
we have
R> library(gridExtra)
R> grid.arrange(g1, g2, g3, g4, nrow=2)
where g1, g2, g3 and g4 are standard ggplot2 graph objects.
An alternative way of creating figure grids, is to use viewports. First, Using viewports gives you more flexi-
we load the grid package and create a convenience function bility, but is more complicated.

R> library(grid)
R> vplayout = function(x, y)
+ viewport(layout.pos.row = x, layout.pos.col = y)
Next we create a new page, with a 2 × 2 layout
R> grid.newpage()
R> pushViewport(viewport(layout = grid.layout(2, 2)))
Finally, we add the individual graphics. The plot created using the h
object, is placed on the first row and spans both columns:
R> print(h, vp = vplayout(1, 1:2))
The others figures are placed on the second row (figure 4.6):
R> print(h1, vp = vplayout(2, 1))
R> print(h2, vp = vplayout(2, 2))
5
Reshaping data

A common problem is that we receive data that requires restructuring.


In this chapter we will look at common ways of restructuring data to
make it more amenable to R manipulation.

5.1 The melt function

Sometimes a single variable is found in multiple columns. For example,


consider table 5.1.

Table 5.1: Some example patient data.


Patient
Patients 1–3 are columns in the table.
Gene 1 2 3
A 10.1 15.2 20.5
B 9.1 10.2 8.7
C 5.6 4.8 5.1

The patient variable is spread across three columns. After inputting


the data into R and have the following data frame
R> patient
Gene Patient1 Patient2 Patient3
1 A 10.1 15.2 20.5
2 B 9.1 10.2 8.7
3 C 5.6 4.8 5.1
To (easily) combine the patients into a single column we use the melt1 1
The melt function works with lists
function which is part of the reshape2 package. Using the melt function and arrays. You can also specify how
it should handle missing values.
is a three step process:

1. Identify the columns that we want untransformed. In this example,


we want to protect the column Gene.

2. Melt the remaining columns into a single column; use the id argu-
ment to identify the protected columns

R> pat_comb = melt(patient, id=c("Gene"))

which gives
advanced r graphics 25

R> pat_comb
Gene variable value
1 A Patient1 10.1
2 B Patient1 9.1
3 C Patient1 5.6
4 A Patient2 15.2
5 B Patient2 10.2
6 C Patient2 4.8
7 A Patient3 20.5
8 B Patient3 8.7
9 C Patient3 5.1

3. Rename the columns (if appropriate).


6
R setup

The examples in the notes are generated using the following R setup:
R> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 14.2
year 2012
month 02
day 29
svn rev 58522
language R
version.string R version 2.14.2 (2012-02-29)
To obtain the version number of a particular package, use
R> packageDescription("ggplot2")$Version
[1] "0.9.0"
The packages used in these notes are given in table 6.1.

Table 6.1: List of packages used in


Package Version
these notes
ggplot2 0.9.0
grid 2.14.2
gridExtra 0.9
hexbin 1.26.0
reshape2 1.2.1
scales 0.2.0
Note that version 0.9 of ggplot2 introduced a number of new features,
that weren’t available in previous versions.
Bibliography

P Murrell. R Graphics. CRC Press, 2 edition, 2011.

D Sarkar. Lattice: Multivariate Data Visualization with R (Use R!).


Springer, 1st edition, 2008.

H Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer,


New York, 2009. ISBN 978-0-387-98140-6.

L Wilkinson. The Grammar of Graphics. Springer, 1st edition, 1999.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy