Data Visualization With Ggplot2
Data Visualization With Ggplot2
install.packages("tidyverse")
library(tidyverse)
Mpg
Creating a ggplot
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Exercises
1. Run ggplot(data = mpg). What do you see?
3. What does the drv variable describe? Read the help for ?mpg to find out.
The drv variable is a categorical variable which categorizes cars into front-wheels,
rear-wheels, or four-wheel drive. 1
Aesthetic Mappings
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
Exercises
1. What’s gone wrong with this code? Why are the points not blue? ggplot(data = mpg) +
geom_point( mapping = aes(x = displ, y = hwy, color = "blue") )
2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the
documentation for the dataset.) How can you see this information when you run mpg?
The following list contains the categorical variables in mpg:
∙ manufacturer
∙ model
∙ trans
∙ drv
∙ fl
∙ class
∙ year
∙ cyl
∙ cty
∙ hwy
3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for
categorical versus contin‐ uous variables?
The variable cty, city highway miles per gallon, is a continuous variable.
When mapped to size, the sizes of the points vary continuously as a function of their size
ggplot(mpg, aes(x = displ, y = hwy, shape = cty)) +
geom_point()
#> Error: A continuous variable can not be mapped to shape
When a continuous value is mapped to shape, it gives an error. Though we could split a
continuous variable into discrete categories and use a shape aesthetic, this would
conceptually not make sense. A numeric variable has an order, but shapes do not. It is clear
that smaller points correspond to smaller values, or once the color scale is given, which
colors correspond to larger or smaller values. But it is not clear whether a square is greater
or less than a circle.
4. What happens if you map the same variable to multiple aesthet‐ ics?
ggplot(mpg, aes(x = displ, y = hwy, colour = hwy, size = displ)) +
geom_point()
5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point.)
Stroke changes the size of the border for shapes (21-25). These are filled shapes in which
the color and size of the border can differ from that of the filled interior of the shape.
For example,
ggplot(mtcars, aes(wt, mpg)) +
geom_point(shape = 21, colour = "black", fill = "white", size = 5, stroke = 5)
6. What happens if you map an aesthetic to something other than a variable name, like aes(color = displ
< 5)?
Facets
To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a
formula, which you create with ~ followed by a variable name (here “formula” is the name of a data
structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be
discrete:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
To facet your plot on the combination of two variables, add facet_grid() to your plot call. The first
argument of facet_grid() is also a formula. This time the formula should contain two variable names
separated by a ~:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
Exercises
1. What happens if you facet on a continuous variable?
The continuous variable is converted to a categorical variable, and the plot contains a facet
for each distinct value.
2. What do the empty cells in a plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
ggplot(data = mpg) + geom_point(mapping = aes(x = drv, y = cyl))
ggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = cty)) +
facet_grid(drv ~ cyl)
The empty cells (facets) in this plot are combinations of drv and cyl that have no
observations. These are the same locations in the scatter plot of drv and cyl that have no
points.
3. What plots does the following code make? What does . do?
The symbol . ignores that dimension when faceting. For example, drv ~ . facet by values of
drv on the y-axis.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
While, . ~ cyl will facet by values of cyl on the x-axis.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
6. When using facet_grid () you should usually put the variable with more unique levels in the columns.
Why?
Geometric Objects
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the
type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms,
boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.
. geom_smooth() will draw a different line, with a different linetype, for each unique value of the
variable that you map to linetype:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
ggplot(data = mpg) +
geom_smooth( mapping = aes(x = displ, y = hwy, color = drv),
show.legend = FALSE )
To display multiple geoms in the same plot, add multiple geom functions to ggplot():
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
The local data argument in geom_smooth() over‐ rides the global data argument in ggplot() for that
layer only:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth( data = filter(mpg, class == "subcompact"),
se = FALSE )
Exercises
1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
● line chart: geom_line()
● boxplot: geom_boxplot()
● histogram: geom_histogram()
● area chart: geom_area()
2. Run this code in your head and predict what the output will look like. Then, run the code in R and
check your predictions:
ggplot( data = mpg, mapping = aes(x = displ, y = hwy, color = drv) ) + geom_point() + geom_smooth(se =
FALSE)
3. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it
earlier in the chapter?
The theme option show.legend = FALSE hides the legend box.
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, colour = drv),
show.legend = FALSE
)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
In that plot, there is no legend. Removing the show.legend argument or setting
show.legend = TRUE will result in the plot having a legend displaying the mapping
between colors and drv.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, colour = drv))
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The legend is suppressed because with three plots, adding a legend to only the last
plot would make the sizes of plots different. Different sized plots would make it more
difficult to see how arguments change the appearance of the plots. The purpose of
those plots is to show the difference between no groups, using a group aesthetic, and
using a color aesthetic, which creates implicit groups. In that example, the legend
isn’t necessary since looking up the values associated with each color isn’t necessary
to make that point.
4. What does the se argument to geom_smooth() do?
No. Because both geom_point() and geom_smooth() will use the same data and
mappings. They will inherit those options from the ggplot() object, so the mappings
don’t need to specified again.
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
6. Re-create the R code necessary to generate the following graphs