PRACTICUM, Day 1: R Graphing: Basic Plotting and Ggplot2: CRG Bioinformatics Unit, Sarah - Bonnin@crg - Eu May 6th, 2016
PRACTICUM, Day 1: R Graphing: Basic Plotting and Ggplot2: CRG Bioinformatics Unit, Sarah - Bonnin@crg - Eu May 6th, 2016
PRACTICUM, Day 1: R Graphing: Basic Plotting and Ggplot2: CRG Bioinformatics Unit, Sarah - Bonnin@crg - Eu May 6th, 2016
ggplot2
CRG Bioinformatics Unit, sarah.bonnin@crg.eu
May 6th, 2016
Contents
Introduction 2
Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Basic plotting 3
Scatter plots: plot() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Histogram: hist() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1
Introduction
In this tutorial, you will learn how to use some R functions and parameters to represent your data. We will
mainly focus on producing and customizing graphs using the ggplot2 R package.
Packages
If R does not find automatically the packages, you can specify a package repository:
Once the packages are installed, you can load them in your current R session:
library("ggplot2")
library("RColorBrewer")
library("gridExtra")
library("gplots")
library("VennDiagram")
Dataset
We will use the diamonds dataset from the ggplot2 package (added automatically into the global environment
when ggplot2 is loaded). This dataset contains the prices and other attributes of almost 54,000 diamonds.
It is interesting for us because it contains plenty of discrete and continuous data that we will be able to use:
head(diamonds)
2
Basic plotting
Basic plotting functions in R include (but are not limited to):
Parameters can be added to each of these functions to control sizes, colors, labels, lines, and more.
We will see a few examples of scatter plots and box plots using the basic R functions.
A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display
values for typically two variables for a set of data. (Wikipedia)
plot(x, y)
## carat price
## 1 0.23 326
## 2 0.21 326
## 3 0.23 327
## 4 0.29 334
## 5 0.31 335
## 6 0.24 336
If a data frame is given as an input, plot will use all columns and create as many pairwise scatter plots as
possible.
So command:
plot(diamonds1$price, diamonds1$carat)
plot(diamonds1)
3
15000
price
5000
0
1 2 3 4 5
carat
• Add a title
• Change x and y axis labels
5000
0
1 2 3 4 5
Carat
4
Parameters col, cex and pch can be used to modify point color, point size and point shape, respectively.
POINT TYPES
BASIC COLORS
A vector of 657 colors is available by default:
colors()
You can for example take the first 10 values from that vector:
colors()[1:10]
?sample
# sample(x, size=n): randomly select n elements from x
sample(colors(), 10)
5
Exercise 1
Review the previous scatter plot to obtain the following one:
5000
0
1 2 3 4 5
Carat
• pch
• col
• cex
6
Histogram: hist()
A histogram is a plot that lets you discover, and show, the underlying frequency distribution of a
set of continuous data. (Wikipedia)
hist(diamonds$price)
Histogram of diamonds$price
15000
10000
Frequency
5000
0
diamonds$price
We can change bar color, x and y labels, and add a title to the graph the same way as with the plot function.
hist(diamonds$price, col="blue",
xlab="Prices of diamonds",
ylab="Frequency",
main="Frequency of diamond prices")
7
Frequency of diamond prices
15000
10000
Frequency
5000
0
Prices of diamonds
Bins represent the number of bars into which the histogram is divided.
The higher the number of bins, the higher the histogram’s resolution.
We can change more parameters, to:
8
Frequency of diamond prices
Frequency
Prices of diamonds
We can add vertical or horizontal lines: Here we want to display a green vertical line to show the average
price.
abline function is called to add the line, additionally to the hist function and not as a parameter
hist(diamonds$price, col="blue",
xlab="Prices of diamonds",
ylab="Frequency",
main="Frequency of diamond prices",
breaks=30,
axes=FALSE)
# abline: "v" specifies the x-value(s) for vertical line(s).
abline(v=mean(diamonds$price), col="green")
9
Frequency of diamond prices
Frequency
Prices of diamonds
10
Exercise 2
Produce the following histogram:
Prices of diamonds
LINE TYPES in R
11
Plotting with ggplot2
Graphing package inspired by the Grammar of Graphics seminal work of Leland Wilkinson. A
tool that enables us to concisely describe the components of a graphic.
Why ggplot2?
• More flexible
• More customizable
• Prettier
• Easy to modify
• Well documented.
Getting started
• a data set.
• a coordinate system.
• a set of geoms: visual marks that represent the data point.
Once the plot is started, you can add layers and additional elements to customize the plot.
Layers elements are added with +.
Starting a plot: “base” layer:
# One variable
a <- ggplot(dataframe, aes(x))
# Two variables
a <- ggplot(dataframe, aes(x, y))
aes function generate aesthetic mappings that describe how variables in the data are mapped to visual
properties (aesthetics) of geoms.
A few simple examples of the possible geoms that can be mapped:
12
Barplots: geom_bar()
A bar chart or bar graph is a chart that presents grouped data with rectangular bars with lengths
proportional to the values that they represent. (Wikipedia)
We will start with the most basic barplot using the geom_bar() geom, representing the different diamond
cuts:
20000
15000
count
10000
5000
13
20000
15000
cut
Fair
Good
count
5000
We might want to change the default color scheme. There are plenty of color schemes in R.
Palettes available by default
The following functions are pre-made palettes that create vectors of n different colors:
rainbow(n)
heat.colors(n)
terrain.colors(n)
topo.colors(n)
cm.colors(n)
For example:
rainbow(n=10)
Additional palettes
One example is the package RColorBrewer that provides many color schemes for graphics:
14
# Check what the main function offers
?brewer.pal
15
Exercise 3
Modify barp1 to obtain the following plot (barp2)
20000
15000
cut
Fair
Good
count
5000
Using:
• scale_fill_manual layer
• top.colors palette
16
We can next add a title to the plot:
20000
15000
cut
Fair
Good
count
5000
The diamond dataset gives additional information about the diamonds color (diamonds$color).
We can color the bars given the proportion of each of the colors present for each type of cuts:
17
Types of diamond cuts, and diamonds colors
20000
color
15000
D
E
F
count
10000 G
H
I
J
5000
A pie chart is a type of graph in which a circle is divided into sectors that each represent a
proportion of the whole. (Wikipedia)
# table: builds a contingency table of the counts at each combination of factor levels.
cuts_counts <- as.data.frame(table(diamonds$cut))
cuts_counts
## Var1 Freq
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
18
40000
Var1
Fair
Good
Freq
Very Good
Premium
20000 Ideal
The convertion into a pie plot is done by transforming the coordinates to the polar coordinate system (most
commonly used for pie charts):
19
0
50000
1.25
1.00
10000
Var1
Fair
0.75
Good
1
20000
30000
Freq
Change the default colors to the ones of your choice, the same way as we did for the barplot:
0
50000
1.25
1.00
10000
Var1
Fair
0.75
Good
1
20000
30000
Freq
Please note that it is rather quick and easy to produce a pie plot with the basic plotting
function pie()
It is simply done with:
20
pie(table(diamonds$cut))
Very Good
Good
Premium
Fair
Ideal
Histograms: geom_hist()
A histogram is a plot that lets you discover, and show, the underlying frequency distribution of a
set of continuous data. (Wikipedia)
Let’s start with the most basic histogram: we will display the diamonds price distribution:
21
10000
count
5000
We can change the bin size (default to 30) to get higher resolution:
22
10000
7500
count
5000
2500
Now let’s take into account the color of the diamond (J and D being the worst and best colors, respectively).
Add this parameter as a ggplot aesthetic:
23
10000
7500 color
D
E
F
count
5000
G
H
I
J
2500
Add a red vertical line at the median price value, and a blue vertical line at the average price value
using geom_vline layers:
# geom_vline() for vertical lines: xintercept is their starting coordinate on the x axis
histp3 <- histp2 +
geom_vline(xintercept=median_price, linetype="dotdash", colour="red", size=0.5) +
geom_vline(xintercept=mean_price, linetype="dotdash", colour="blue", size=0.5)
histp3
24
10000
7500 color
D
E
F
count
5000
G
H
I
J
2500
25
Exercice 4
Add text to histp3 that displays the mean and the median values
We want to obtain this plot:
10000
7500 color
D
2401 E
F
count
5000 3932.8 G
H
I
J
2500
26
Function grid.arrange from gridExtra package allows to display several plots in one page.
Using grid.arrange, plots are organized in the page in a matrix-like manner of ncol columns and nrow rows.
grid.arrange(histp1,
histp2,
histp3,
histp4,
ncol=2,
nrow=2)
color
10000 10000
D
7500 7500 E
F
count
count
5000 5000
G
H
2500 2500
I
0 0 J
0 5000 10000 15000 20000 0 5000 10000 15000 20000
price price
color color
10000 10000
D D
7500 E 7500 E
F 2401 F
count
count
5000
G
5000 3932.8 G
H H
2500 2500
I I
0 J 0 J
0 5000 10000 15000 20000 0 5000 10000 15000 20000
price price
grid.arrange(histp2,
histp3,
histp4,
ncol=3,
nrow=1)
count
count
5000
G
5000
G
5000 3932.8 G
H H H
2500 I 2500 I 2500 I
J J J
0 0 0
0 5000 10000 15000 20000 0 5000 10000 15000 20000 0 5000 10000 15000 20000
price price price
27
Boxplots: geom_boxplot()
We will produce here boxplots of the distribution of diamond prices according to their colors.
Basic boxplot, colored given the diamond color:
15000
color
D
E
F
price
10000
G
H
I
J
5000
0
D E F G H I J
color
For more clarity we might want to remove the outliers, reduce the y axis and add more breaks to it:
28
10000
9000
8000
color
7000
D
6000 E
F
price
5000
G
4000 H
I
3000
J
2000
1000
D E F G H I J
color
29
10000
9000
8000
color
7000
D
6000 E
F
price
5000
G
4000 H
I
3000
J
2000
1000
J I H G F E D
color
We will next take into account the diamond cut property of the diamonds.
If that property is taken into account at the initial aesthetics level, each of the previous boxplot will be
divided into several boxplots representing the different diamonds cuts:
30
5
cut
Fair
3
Good
carat
Very Good
Premium
2
Ideal
0
D E F G H I J
color
31
15000
cut
Fair
Good
price
10000
Very Good
Premium
Ideal
5000
0
D E F G H I J
color
32
J
H
cut
Fair
Good
color
G
Very Good
Premium
Ideal
F
33
Exercise 5
Produce the following notched boxplot
Ideal
Premium
color
D
E
F
cut
Very Good
G
H
I
J
Good
Fair
Notches are used to compare groups; if the notches of two boxes do not overlap, this is evidence that the
medians differ.
Additionally to what we have already seen, you will need:
34
Scatter plots: geom_point()
A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display
values for typically two variables for a set of data. (Wikipedia)
We will use a random subset of 5000 diamonds (only for a less busy plot) for producing the scatter plots.
## [1] 5000 10
We will plot the carat versus price of each diamond. Each point in the following scatter plots represents a
single diamond.
2
carat
Add an information to the graph: color the points according to the diamond color:
35
3
color
D
E
2 F
carat
G
H
I
J
1
36
color
D
3
E
F
G
H
I
2
carat
cut
Fair
Good
1
Very Good
Premium
Ideal
Note how legends are being added as you had layers of information to the plot.
You might want to control the shape and size of the points (refer to the point shape table introduced in the
basic plotting):
37
color
D
3
E
F
G
H
I
2
carat
cut
Fair
Good
1
Very Good
Premium
Ideal
We will use the RColorBrewer package previously introduced to color our plot using different palettes.
Palettes from that package are the following ones:
Blues BuGn BuPu GnBu Greens Greys Oranges OrRd PuBu PuBuGn PuRd Purples RdPu
Reds YlGn YlGnBu YlOrBr YlOrRd BrBG PiYG PRGn PuOr RdBu RdGy RdYlBu RdYlGn
Spectral Accent Dark2 Paired Pastel1 Pastel2 Set1 Set2 Set3
38
Exercise 6
Modify scatp3 and give the palettes a try to obtain the following plot (or any plot that you
like best):
E
F
3 G
H
I
carat
2
J
cut
1 Fair
Good
Very Good
0 5000 10000 15000 Premium
price Ideal
39
We will now add a linear regression line to scatp4:
color
D
3 E
F
G
H
I
2
carat
cut
Fair
Good
1
Very Good
Premium
Ideal
Finally, we will reverse the y axis, and add a title to the plot:
40
Carat, Price and Cut of 5000 diamonds
color
D
E
1 F
G
H
I
carat
J
2
cut
Fair
Good
3 Very Good
Premium
Ideal
We can save some scatter plots into a jpeg file, and increase the dimensions of the output file (by
default for png/pdf/tiff formats, width=height=480 pixels):
The process is similar to save graphicals in pdf, png, bmp or tiff formats:
41
dev.off()
Heatmaps: heatmap.2()
# seq(from=1, to=2, by=0.5): generate sequence from 1 to 2 with 0.5 steps (1.0 1.5 2.0)
# sample(x, size=n): randomly select n elements from x
# paste(x, y, sep="_"): pastes x, y (as many elements as needed) together,
# separated here with "_"
in_heat <- matrix(sample(seq(0, 20, 0.001), 400),
nrow=50,
ncol=8,
dimnames=list(1:50, c(paste("S", 1:4, sep=""), paste("T", 1:4, sep=""))))
heatmap.2(in_heat)
42
Color Key
and Histogram
Count
22
5 10
Value
18
11
24
42
8
44
31
19
25
13
3
10
43
32
5
39
30
7
2
29
9
36
50
17
45
35
23
28
27
49
14
22
26
33
20
38
6
47
34
1
15
40
41
37
16
21
48
46
4
12
T3
S2
S1
S3
T2
T1
T4
S4
We can change the color scheme to green-red blue-red
heatmap.2(in_heat, col="greenred")
Color Key
and Histogram
Count
22
5 10
Value
18
11
24
42
8
44
31
19
25
13
3
10
43
32
5
39
30
7
2
29
9
36
50
17
45
35
23
28
27
49
14
22
26
33
20
38
6
47
34
1
15
40
41
37
16
21
48
46
4
12
T3
S2
S1
S3
T2
T1
T4
S4
43
heatmap.2(in_heat, col="bluered")
Color Key
Count and Histogram
22
5 10
Value
18
11
24
42
8
44
31
19
25
13
3
10
43
32
5
39
30
7
2
29
9
36
50
17
45
35
23
28
27
49
14
22
26
33
20
38
6
47
34
1
15
40
41
37
16
21
48
46
4
12
T3
S2
S1
S3
T2
T1
T4
S4
Now imagine that you are analyzing an experiment, and samples T (1, 2 and 3) come from a different
experimental group as samples S (1, 2 and 3).
Apart from the names, it would be interesting to represent experimental groups with color boxes, especially
for large experiments.
# color vector
mycol <- colnames(in_heat)
# grep(pattern, x): search for matches to pattern within each element of x
mycol[grep("T", mycol)] <- "green"
mycol[grep("S", mycol)] <- "orange"
# ColSideColors: add colored boxes.
heatmap.2(in_heat, col="bluered",
ColSideColors=mycol)
44
Color Key
and Histogram
Count
22
5 10
Value
18
11
24
42
8
44
31
19
25
13
3
10
43
32
5
39
30
7
2
29
9
36
50
17
45
35
23
28
27
49
14
22
26
33
20
38
6
47
34
1
15
40
41
37
16
21
48
46
4
12
T3
S2
S1
S3
T2
T1
T4
S4
By default, heatmap.2 uses hierarchical clustering to reorder rows and columns of the input matrix.
Clustering of either the rows, the columns, can be turned off, so as to keep the matrix the way it is inputted
(rows, columns, or both):
45
Color Key
and Histogram
Count
22
5 10
Value
18
11
24
42
8
44
31
19
25
13
3
10
43
32
5
39
30
7
2
29
9
36
50
17
45
35
23
28
27
49
14
22
26
33
20
38
6
47
34
1
15
40
41
37
16
21
48
46
4
12
S1
S2
S3
S4
T1
T2
T3
T4
46
Exercise 7
Create the following heatmap:
A pretty heatmap.2
3
1
4
T
T
KO
KO
KO
KO
W
W
• Have a look at the (very exhaustive) help page of the function: ?heatmap.2
– dendrogram
– Colv
– labCol and labRow
– strCol
– trace
– key
– main
47
Venn Diagrams: venn.diagram()
A Venn diagram (also known as a set diagram or logic diagram) is a diagram that shows all
possible logical relations between a finite collection of different sets (Wikipedia).
## [1] 5 80 58 69 90 84 44 45 32 17 42 97 24 77 52 57 30
## [18] 50 37 9 2 89 29 28 56 15 4 34 60 14 18 94 39 91
## [35] 41 0 100 10 48 72
## [1] 57 21 76 22 81 40 14 54 38 31 89 5 58 73 59 67 12 25 29 26 19 9 10
## [24] 88 41 74 32 2 28 87
Function venn.diagram from package VennDiagram takes a list as an input. We will create a list with names
containing both vectors.
## $Vector1
## [1] 5 80 58 69 90 84 44 45 32 17 42 97 24 77 52 57 30
## [18] 50 37 9 2 89 29 28 56 15 4 34 60 14 18 94 39 91
## [35] 41 0 100 10 48 72
##
## $Vector2
## [1] 57 21 76 22 81 40 14 54 38 31 89 5 58 73 59 67 12 25 29 26 19 9 10
## [24] 88 41 74 32 2 28 87
Let’s plot the most basic Venn diagram. Each component of the list represents a circle, and the plot will be
saved in the VennDiagram.tiff file in the current directory.
venn.diagram(mylist,
filename="VennDiagram.tiff")
48
This is a two-set Venn diagram.
venn.diagram works with list up to 5 vectors!
Let’s try a three-set Venn:
# Additional vector
vec3 <- sample(seq(0,100, 1), 35)
# new list containing the 5 vectors
mylist3 <- list(Vector1=vec1, Vector2=vec2, Vector3=vec3)
# Produce Venn
venn.diagram(mylist3,
filename="VennDiagram_3.tiff")
49
50
Exercise 8
Try to produce the following plot:
51
– cex
• And more:
– margin
– lty
– cat.dist
– main
END
52