0% found this document useful (0 votes)
0 views26 pages

Data Visualization With Ggplot2

The document provides an overview of data visualization using the ggplot2 package in R, including how to create scatterplots, use aesthetic mappings, and apply faceting techniques. It covers exercises related to understanding the mpg dataset, the differences between categorical and continuous variables, and the use of various geometric objects in plots. Additionally, it discusses the implications of mapping aesthetics and the use of functions like geom_smooth and facet_wrap.

Uploaded by

121324030022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views26 pages

Data Visualization With Ggplot2

The document provides an overview of data visualization using the ggplot2 package in R, including how to create scatterplots, use aesthetic mappings, and apply faceting techniques. It covers exercises related to understanding the mpg dataset, the differences between categorical and continuous variables, and the use of various geometric objects in plots. Additionally, it discusses the implications of mapping aesthetics and the use of functions like geom_smooth and facet_wrap.

Uploaded by

121324030022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Visualization with ggplot2

install.packages("tidyverse")
library(tidyverse)

The mpg Data Frame

Mpg

Creating a ggplot

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Exercises
1. Run ggplot(data = mpg). What do you see?

2. How many rows are in mtcars? How many columns?

3. What does the drv variable describe? Read the help for ?mpg to find out.

The drv variable is a categorical variable which categorizes cars into front-wheels,
rear-wheels, or four-wheel drive. 1

4. Make a scatterplot of hwy versus cyl.

ggplot(mpg, aes(x = cyl, y = hwy)) +


geom_point()
5. What happens if you make a scatterplot of class versus drv? Why is the plot not useful?
The resulting scatterplot has only a few points.

ggplot(mpg, aes(x = class, y = drv)) +


geom_point()

Aesthetic Mappings

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))

# Top ggplot(data = mpg) +


geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

# Bottom ggplot(data = mpg) +


geom_point(mapping = aes(x = displ, y = hwy, shape = class))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Exercises
1. What’s gone wrong with this code? Why are the points not blue? ggplot(data = mpg) +
geom_point( mapping = aes(x = displ, y = hwy, color = "blue") )

ggplot (data = mpg) +


geom_point (mapping = aes (x = displ, y = hwy), colour = "blue")
The argument colour = "blue" is included within the mapping argument, and as
such, it is treated as an aesthetic, which is a mapping between a variable and a
value. In the expression, colour = "blue", "blue" is interpreted as a categorical
variable which only takes a single value "blue". If this is confusing, consider how
colour = 1:234 and colour = 1 are interpreted by aes().

2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the
documentation for the dataset.) How can you see this information when you run mpg?
The following list contains the categorical variables in mpg:
∙ manufacturer

∙ model

∙ trans

∙ drv

∙ fl

∙ class

The following list contains the continuous variables in mpg:


∙ displ

∙ year

∙ cyl

∙ cty

∙ hwy

3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for
categorical versus contin‐ uous variables?

The variable cty, city highway miles per gallon, is a continuous variable.

ggplot(mpg, aes(x = displ, y = hwy, colour = cty)) +


geom_point()
Instead of using discrete colors, the continuous variable uses a scale that varies from a light
to dark blue color.
ggplot(mpg, aes(x = displ, y = hwy, size = cty)) +
geom_point()

When mapped to size, the sizes of the points vary continuously as a function of their size
ggplot(mpg, aes(x = displ, y = hwy, shape = cty)) +
geom_point()
#> Error: A continuous variable can not be mapped to shape
When a continuous value is mapped to shape, it gives an error. Though we could split a
continuous variable into discrete categories and use a shape aesthetic, this would
conceptually not make sense. A numeric variable has an order, but shapes do not. It is clear
that smaller points correspond to smaller values, or once the color scale is given, which
colors correspond to larger or smaller values. But it is not clear whether a square is greater
or less than a circle.

4. What happens if you map the same variable to multiple aesthet‐ ics?
ggplot(mpg, aes(x = displ, y = hwy, colour = hwy, size = displ)) +
geom_point()

5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point.)
Stroke changes the size of the border for shapes (21-25). These are filled shapes in which
the color and size of the border can differ from that of the filled interior of the shape.
For example,
ggplot(mtcars, aes(wt, mpg)) +
geom_point(shape = 21, colour = "black", fill = "white", size = 5, stroke = 5)

6. What happens if you map an aesthetic to something other than a variable name, like aes(color = displ
< 5)?

ggplot(mpg, aes(x = displ, y = hwy, colour = displ < 5)) +


geom_point()
Aesthetics can also be mapped to expressions like displ < 5. The ggplot() function
behaves as if a temporary variable was added to the data with values equal to the result of
the expression. In this case, the result of displ < 5 is a logical variable which takes values
of TRUE or FALSE.

Facets
To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a
formula, which you create with ~ followed by a variable name (here “formula” is the name of a data
structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be
discrete:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)

To facet your plot on the combination of two variables, add facet_grid() to your plot call. The first
argument of facet_grid() is also a formula. This time the formula should contain two variable names
separated by a ~:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)

Exercises
1. What happens if you facet on a continuous variable?

ggplot(mpg, aes(x = displ, y = hwy)) +


geom_point() +
facet_grid(. ~ cty)

The continuous variable is converted to a categorical variable, and the plot contains a facet
for each distinct value.

2. What do the empty cells in a plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
ggplot(data = mpg) + geom_point(mapping = aes(x = drv, y = cyl))
ggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = cty)) +
facet_grid(drv ~ cyl)

The empty cells (facets) in this plot are combinations of drv and cyl that have no
observations. These are the same locations in the scatter plot of drv and cyl that have no
points.

3. What plots does the following code make? What does . do?
The symbol . ignores that dimension when faceting. For example, drv ~ . facet by values of
drv on the y-axis.

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
While, . ~ cyl will facet by values of cyl on the x-axis.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)

4. Take the first faceted plot in this section:


ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages?
How might the balance change if you had a larger dataset?
In the following plot the class variable is mapped to color.
5. Read? facet_wrap. What does nrow do? What does ncol do? What other options control the layout of
the individual panels? Why doesn’t facet_grid() have nrow and ncol variables?

6. When using facet_grid () you should usually put the variable with more unique levels in the columns.
Why?

Geometric Objects
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the
type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms,
boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.

. geom_smooth() will draw a different line, with a different linetype, for each unique value of the
variable that you map to linetype:

ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
ggplot(data = mpg) +
geom_smooth( mapping = aes(x = displ, y = hwy, color = drv),
show.legend = FALSE )

To display multiple geoms in the same plot, add multiple geom functions to ggplot():
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +


geom_point() +
geom_smooth()

This makes it pos‐ sible to display different aesthetics in different layers:


ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()

The local data argument in geom_smooth() over‐ rides the global data argument in ggplot() for that
layer only:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth( data = filter(mpg, class == "subcompact"),
se = FALSE )

Exercises
1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
● line chart: geom_line()
● boxplot: geom_boxplot()
● histogram: geom_histogram()
● area chart: geom_area()

2. Run this code in your head and predict what the output will look like. Then, run the code in R and
check your predictions:
ggplot( data = mpg, mapping = aes(x = displ, y = hwy, color = drv) ) + geom_point() + geom_smooth(se =
FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, colour = drv)) +


geom_point() +
geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This code produces a scatter plot with displ on the x-axis, hwy on the y-axis, and the
points colored by drv. There will be a smooth line, without standard errors, fit through
each drv group.

3. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it
earlier in the chapter?
The theme option show.legend = FALSE hides the legend box.
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, colour = drv),
show.legend = FALSE
)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
In that plot, there is no legend. Removing the show.legend argument or setting
show.legend = TRUE will result in the plot having a legend displaying the mapping
between colors and drv.

ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, colour = drv))
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The legend is suppressed because with three plots, adding a legend to only the last
plot would make the sizes of plots different. Different sized plots would make it more
difficult to see how arguments change the appearance of the plots. The purpose of
those plots is to show the difference between no groups, using a group aesthetic, and
using a color aesthetic, which creates implicit groups. In that example, the legend
isn’t necessary since looking up the values associated with each color isn’t necessary
to make that point.
4. What does the se argument to geom_smooth() do?

It adds standard error bands to the lines.


ggplot(data = mpg, mapping = aes(x = displ, y = hwy, colour = drv)) +
geom_point() +
geom_smooth(se = TRUE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
By default se = TRUE:

5. Will these two graphs look different? Why/why not?


ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth() ggplot() +
geom_point( data = mpg, mapping = aes(x = displ, y = hwy) ) + geom_smooth( data = mpg, mapping =
aes(x = displ, y = hwy) )

No. Because both geom_point() and geom_smooth() will use the same data and
mappings. They will inherit those options from the ggplot() object, so the mappings
don’t need to specified again.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +


geom_point() +
geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
6. Re-create the R code necessary to generate the following graphs

The following code will generate those plots.


ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se = FALSE)

ggplot(mpg, aes(x = displ, y = hwy)) +


geom_smooth(mapping = aes(group = drv), se = FALSE) +
geom_point()

ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +


geom_point() +
geom_smooth(se = FALSE)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(colour = drv)) +
geom_smooth(se = FALSE)

ggplot(mpg, aes(x = displ, y = hwy)) +


geom_point(aes(colour = drv)) +
geom_smooth(aes(linetype = drv), se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy)) +


geom_point(size = 4, color = "white") +
geom_point(aes(colour = drv))

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy