3.DataFrames.GGPlot2
3.DataFrames.GGPlot2
3.
R DataFrames
Visualizing Results
1
Knirsch
Introduction to Programming and Data Science with R
• If R returns an error you know that you need to correct your code.
• If your code runs through without error, the code is syntactically correct but may nevertheless be meaningless,
non-sensical, misleading or logically wrong.
If your code runs through it does NOT necessarily mean that the code is meaningful or correct.
2
Knirsch
Introduction to Programming and Data Science with R
table(cost_cover)
table(sales_categories)
table(sales_rev_after_tax)
sales_rev_after_tax[2,4]
sales_complete[2,4]
3
Knirsch
Introduction to Programming and Data Science with R
1 Vector
DataFrame
2
tibble (dplyr)
4
Knirsch
Introduction to Programming and Data Science with R
sales_complete$sales_categories
In order to read out a row, we need the index (= position). A dataframe is a two-dimensional object
→ there are two index positions: row-index and column-index:
5
Knirsch
Introduction to Programming and Data Science with R
6
Knirsch
Introduction to Programming and Data Science with R
We want to read out all (complete) rows with category "Very Poor"
7
Knirsch
Introduction to Programming and Data Science with R
dplyr::filter() syntax:
output_df <- filter(input_df, condition)
We want to read out all (complete) rows with category "Very Poor"
8
Knirsch
Introduction to Programming and Data Science with R
One can do almost everything with ggplot2 - graphic-wise. Yet, one type of graphic is not available: the pie-graphic.
The reason is that the pie graphic is considered for amateurs, not very professional, not recommended and very
limited in its possible use: a pie can be deciphered only as long as there is a very limited number of values or
groups. It is basically only meaningful for categorical data with very few categories.
Documentation:
ggplot2, CheatSheet (RStudio documentation)
https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-visualization-2.1.pdf
Hadley Wickham, R for Data Science, chapter 3
https://r4ds.had.co.nz/data-visualisation.html
9
Knirsch
Introduction to Programming and Data Science with R
ggplot2
The basic concept of the ggplot2 grammar of graphics is, that each diagram consists of 3 mandatory layers:
10
Knirsch
Introduction to Programming and Data Science with R
ggplot(data= .................................) +
aes(x= .........................) +
geom_bar()
Which of the following two commands is correct and which is not and why?
11
Knirsch
Introduction to Programming and Data Science with R
Give the chart a title and label the axis with the labs() layer:
+
labs(title = "Revenue Categories", x= "Revenue Categories", y = "Number of Days")
+ theme_classic()
12
Knirsch
Introduction to Programming and Data Science with R
Knirsch
Introduction to Programming and Data Science with R
Factors
The sales_categories are categorical – ordinal data.
Technically factors are vectors of type integer: R stores integers and assigns the categorical values to these integers.
factor(x, levels = ....., ordered = TRUE) #creates an ordered factor with x as input vector and
# the given levels values as categories,
14
Knirsch
Introduction to Programming and Data Science with R
Factors
Did this command work?
Do you get NAs in your new vector?
Attention: A factor only accepts values that are predefined in the levels = .... argument. If there are other
values in the vector, they will get converted to NAs.
The intention of a factor is to only work with the predefined values.
→when defining a factor make sure there are no typos or other syntax errors.
→sales_complete$sales_categories_fac <- factor(sales_complete$sales_categories,
levels= c("Very Good", "Good","Fair","Poor","Very Poor"), ordered = TRUE)
→Check each levels value whether it is spelled the same in the existing vector.
→Check that you list all predefined values.
15
Knirsch
Introduction to Programming and Data Science with R
Factors
Once your factor vector is completely correct, you can overwrite the original vector and remove the additional
factor-vector:
16
Knirsch
Introduction to Programming and Data Science with R
Factors
• Command to convert vector to a factor
17
Knirsch
Introduction to Programming and Data Science with R
Factors
- By default, the levels are assigned to the integer vector in alphabetical or ascending order.
- levels = .........., ordered = TRUE enforces a different order / ranking on factor
values.
$levels
[1] "Very Good" "Good" "Fair" "Poor" "Very Poor"
18
Knirsch
Introduction to Programming and Data Science with R
Factors
19
Knirsch
Introduction to Programming and Data Science with R
Vector,
1
Factors
DataFrame
2
tibble (dplyr)
20
Knirsch
Introduction to Programming and Data Science with R
Knirsch
Introduction to Programming and Data Science with R
Knirsch
Introduction to Programming and Data Science with R
23
Knirsch
Introduction to Programming and Data Science with R
What chart types make sense to use for continous (numerical) data like the daily revenue data?
24
Knirsch
Introduction to Programming and Data Science with R
Histogram
ggplot(data =..........................) +
aes(x =....................................) +
geom_histogram(bins = ................, color = "blue", fill = "lightblue", alpha = 0.5) +
labs(title = "Daily Revenue", x = "Revenue", y = "Frequency")
25
Knirsch
Introduction to Programming and Data Science with R
• Histograms look similar to bar charts: The x-axis shows the values
and the y-axis their frequency in bars.
• How do we get to the bars? The continuous values are classified
into groups with equal intervals.
• The number of groups or their width must always be specified when
creating a histogram (argument bins = .......).
• The bars are displayed directly next to each other with no space
between them. Thus, to remind us that there are actually no
boundaries between values.
• From a histogram, no concrete values can be read.
• However, shape, i.e. symmetry versus skewedness, and the
modality (peak or the number of peaks or highest points) can be
seen well.
26
Knirsch
Introduction to Programming and Data Science with R
Summary
27
Knirsch
Introduction to Programming and Data Science with R
data(faithful)
faithful
The R dataset “faithful” contains a list of 272
observations of geyser eruptions during
October 1980.
28
Knirsch