R Course Own English HS
R Course Own English HS
BMS
1
Program today
• Introduction R / RStudio
• tables in R (with dplyr)
• Make table objects
• Select rows/columns
• Sort rows
• Add new columns
• Aggregate data
• Make graphs with ggplot2
• Simple regression models
2
R for Data Science
3
R for Data Science boek
5
RStudio
6
R and R packages
7
Installing and using packages
8
Installing packages
9
Tidyverse
• Is not standard R!
• Webpage: http://www.tidyverse.org
install.packages(tidyverse)
library(tidyverse)
10
Base R
Variabels / Vectors: int, num
a <- 1L
class(a)
## [1] "integer"
vec1 <- c(1L,3L,5L,7L)
class(vec1)
## [1] "integer"
vec2 <- c(10.5,3.2,pi,4)
class(vec2)
## [1] "numeric"
vec1 + vec2
11
Variabels / Vectors: chr, lgl, date
12
Tables: dataframe or tibble
## # A tibble: 4 x 5
## col1 col2 col3 col4 col5
## <int> <dbl> <chr> <lgl> <date>
## 1 1 10.5 low TRUE 2000-09-14
## 2 3 3.2 low FALSE 2002-07-03
## 3 5 3.14 high TRUE 2004-04-14
13
## 4 7 4 medium FALSE 2004-06-10
Work with columns and vectors
14
Functions
## [1] 15
mean(c(10,14,2000))
## [1] 674.6667
mean(c(10,14,NA), na.rm = T)
## [1] 12
15
Objects
## [1] 200 220 240 260 280 300 320 340 360 380 400
16
Naming of objects
17
R script
18
RStudio Project
19
Exercise session 1
Tables in R
Tables
20
Example: Gapminder table
library(gapminder)
gapminder
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
21
Base R versus tidyverse
# Base R
asia <- gapminder[gapminder$continent == "Asia",]
mean(asia$lifeExp)
## [1] 60.0649
library(dplyr) # Tidyverse
gapminder %>%
filter(continent == "Asia") %>%
summarise(mean_exp = mean(lifeExp))
## # A tibble: 1 x 1
## mean_exp
## <dbl>
## 1 60.1
22
The pipe operator
• filter
• summarise
• group_by (and ungroup)
• mutate
• arrange
• rename
• select
24
filter()
## # A tibble: 4 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
25
filter(): more filters simultaneously (OR operator)
## # A tibble: 8 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Djibouti Africa 2007 54.8 496374 2082.
## 2 Bahrain Asia 1977 65.6 297410 19340.
## 3 Thailand Asia 1997 67.5 60216677 5853.
## 4 Namibia Africa 2007 52.9 2055080 4811.
## 5 Singapore Asia 1982 71.8 2651869 15169.
## 6 Korea, Dem. Rep. Asia 2007 67.3 23301725 1593.
## 7 Nepal Asia 1997 59.4 23001113 1011.
## 8 Reunion Africa 2007 76.4 798094 7670.
26
filter(): more filters simultaneously (AND operator)
## # A tibble: 33 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2002 42.1 25268405 727.
## 2 Bahrain Asia 2002 74.8 656397 23404.
## 3 Bangladesh Asia 2002 62.0 135656790 1136.
## 4 Cambodia Asia 2002 56.8 12926707 896.
## 5 China Asia 2002 72.0 1280400000 3119.
## 6 Hong Kong, China Asia 2002 81.5 6762476 30209.
## 7 India Asia 2002 62.9 1034172547 1747.
## 8 Indonesia Asia 2002 68.6 211060000 2874.
## 9 Iran Asia 2002 69.5 66907826 9241.
## 10 Iraq Asia 2002 57.0 24001816 4391.
## # ... with 23 more rows
27
filter(): use of %in%
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Belgium Europe 1987 75.4 9870200 22526.
## 2 Belgium Europe 1992 76.5 10045622 25576.
## 3 Belgium Europe 1997 77.5 10199787 27561.
## 4 Netherlands Europe 1987 76.8 14665278 23651.
## 5 Netherlands Europe 1992 77.4 15174244 26791.
## 6 Netherlands Europe 1997 78.0 15604464 30246.
28
summarise()
## # A tibble: 1 x 3
## max_exp mean_exp sd_exp
## <dbl> <dbl> <dbl>
## 1 82.6 67.0 12.1
29
combination of group_by() and summarise()
## # A tibble: 5 x 5
## continent max_exp mean_exp sd_exp aantal
## <fct> <dbl> <dbl> <dbl> <int>
## 1 Africa 76.4 54.8 9.63 52
## 2 Americas 80.7 73.6 4.44 25
## 3 Asia 82.6 70.7 7.96 33
## 4 Europe 81.8 77.6 2.98 30
## 5 Oceania 81.2 80.7 0.729 2
30
mutate()
## # A tibble: 4 x 7
## country continent year lifeExp pop gdpPercap just_one
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 1
## 2 Afghanistan Asia 1957 30.3 9240934 821. 1
## 3 Afghanistan Asia 1962 32.0 10267083 853. 1
## 4 Afghanistan Asia 1967 34.0 11537966 836. 1
31
mutate()
## # A tibble: 4 x 7
## country continent year lifeExp pop gdpPercap gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
32
mutate()
## # A tibble: 4 x 7
## country continent year lifeExp pop gdpPercap gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8759.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648.
33
combination of mutate() and group_by
## # A tibble: 142 x 7
## # Groups: continent [5]
## country continent year lifeExp pop gdpPercap rank
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Australia Oceania 2007 81.2 20434176 34435. 1
## 2 Canada Americas 2007 80.7 33390141 36319. 1
## 3 Iceland Europe 2007 81.8 301931 36181. 1
## 4 Japan Asia 2007 82.6 127467972 31656. 1
## 5 Reunion Africa 2007 76.4 798094 7670. 1
## 6 Costa Rica Americas 2007 78.8 4133884 9645. 2
## 7 Hong Kong, China Asia 2007 82.2 6980412 39725. 2
## 8 Libya Africa 2007 74.0 6036914 12057. 2
## 9 New Zealand Oceania 2007 80.2 4115771 25185. 2
## 10 Switzerland Europe 2007 81.7 7554661 37506. 2 34
## # ... with 132 more rows
arrange()
## # A tibble: 5 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Albania Europe 1952 55.2 1282697 1601.
## 3 Algeria Africa 1952 43.1 9279525 2449.
## 4 Angola Africa 1952 30.0 4232095 3521.
## 5 Argentina Americas 1952 62.5 17876956 5911.
35
arrange() : descending order
## # A tibble: 5 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Japan Asia 2007 82.6 127467972 31656.
## 2 Hong Kong, China Asia 2007 82.2 6980412 39725.
## 3 Iceland Europe 2007 81.8 301931 36181.
## 4 Switzerland Europe 2007 81.7 7554661 37506.
## 5 Australia Oceania 2007 81.2 20434176 34435.
36
group_by and ungroup()
## # A tibble: 1 x 6
## # Groups: continent, country [1]
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
temp <- temp %>% ungroup()
head(temp, n = 1)
## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
37
select(): select columns
38
rename(): change column name
head(gapminder_NL, n = 2)
## # A tibble: 2 x 6
## land continent jaar levensverwachting pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
39
Some usefull functions for tables
40
Difference with and without pipe
41
Import datasets
• package readr
• via menu (in Environment button )
library(readr)
lbw <- read_csv("data/lbw_data.csv")
42
Make categorical variables
• package forcats
library(forcats)
lbw <- lbw %>%
mutate(low = factor(low),
low = fct_recode(low,"No" = "0", "Yes" = "1"),
race = factor(race),
race = fct_recode(race,"White" = "1","Black" = "2",
"Other" = "3"),
smoke = factor(smoke),
smoke = fct_recode(smoke,"No" = "0", "Yes" = "1"),
ht = factor(ht),
ht = fct_recode(ht,"No" = "0", "Yes" = "1"),
ui = factor(ui),
ui = fct_recode(ui,"No" = "0", "Yes" = "1"))
43
Exercise session 2
Make graphs with ggplot2
Package ggplot2
• Not standard R
• Based on Grammar of Graphics
• Graph = Data + Layout + Coordinate system
• Graph can have more layers
• A layer has aesthetic (aes) properties coupled with properties of
data
• Handy cheatsheet: https://www.rstudio.com/wp-
content/uploads/2015/03/ggplot2-cheatsheet.pdf
44
Scatterplot example code
library(ggplot2)
45
Scatterplot example
Birthweigth and age mother
●
● ● ●
●
● ●
4000 ●
● ●
●
● ●
●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ●
●
Birth weight
● ● ● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ● ●
●
● ● ● ●
● ●
●
● ● ●
●
● ● ● ● ● ●
●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ●
●
● ●
● ●
● ●
● ● ●
● ● ● ●
2000 ●
●
● ● ●
● ● ●
● ●
●
●
● ●
●
●
●
●
0 46
Scatterplot example code: split according to smoking
47
Scatterplot split according to smoking
5000 ●
●
● ● ●
●
● ●
4000 ●
● ● ● ●
●
●
●
● ● ● ●
●
●
● ●
● ● ● ●
●
●
● ●
● ● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ●
● ●
●
●
● ● ● ●
●
● ● ●
● ● ● ●
● ● ●
● ●
● ●
Birth weight
● ● ● ● ●
3000
● ●
●
●
● ●
● ●
●
● Smoking
● ● ●
●
● ● ●
● ●
●
●
● ● ● ● ● No
● ● ●
● ● ●
●
● ●
● ● ● Yes
● ●
●
● ●
●
● ● ● ●
● ● ●
● ●
●
● ●
● ● ●
● ● ● ●
● ●
● ●
●
● ●
● ● ●
● ● ●
● ● ●
● ●
● ●
2000 ●
● ● ● ●
● ● ●
●
●
●
●
● ●
●
1000
●
48
Histogram
library(ggplot2)
ggplot(data = lbw) +
geom_histogram(aes(x = age, y = ..count.., fill = race)) +
labs(x = "Age mother",
y = "Number",
fill = "Race")
49
Histogram
20
Race
Number
White
Black
Other
10
0 50
Boxplot
library(ggplot2)
ggplot(data = lbw) +
geom_boxplot(aes(x = race, y = age, fill = smoke)) +
labs(x = "Race mother",
y = "Age mother",
fill = "Smoking")
51
Boxplot
40
●
Age mother
Smoking
30
No
Yes
20
52
Subplots
library(ggplot2)
ggplot(data = lbw) +
geom_histogram(aes(x = age, y = ..count..)) +
labs(x = "Age mother",
y = "Number") +
facet_wrap( ~ race, nrow = 1)
53
Histogram split with smoking
White Black Other
6
Number
20 30 40 20 30 40 20 30 40
Age mother
54
Exercise session 3
Simple models
Example: relation between FEV and age
6
●
● ●
5
● ●
● ●
● ●
●
● ●
●
●
● ●
●
● ●
● ●
●
● ●
● ●
● ●
● ● ●
●
● ● ●
● ●
● ● ● ●
●
4 ●
● ● ● ● ●
● ●
●
● ● ● ●
● ● ● ●
● ●
● ● ● ● ●
● ● ● ● ●
●
● ●
●
● ● ●
● ●
● ●
● ●
● ● ●
● ●
● ●
● ● ● ●
● ●
● ●
● ●
fev
● ●
● ● ● ●
● ● ● ●
● ● ●
● ●
● ●
● ● ●
● ●
● ●
●
● ● ● ●
● ● ● ●
● ●
● ●
●
● ●
● ●
● ●
● ● ● ●
●
●
● ● ● ●
3 ●
● ●
●
●
● ●
●
●
●
●
●
● ●
●
● ●
● ● ● ● ● ● ●
●
● ●
● ●
● ● ●
●
● ●
● ● ●
● ● ● ● ● ●
● ●
●
● ● ●
● ● ●
● ● ● ●
●
● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ●
● ●
● ● ●
●
● ● ●
● ●
● ●
● ● ● ● ● ● ●
●
● ● ●
● ● ●
●
● ●
● ● ● ●
●
● ● ● ● ● ●
● ●
● ● ●
● ●
●
● ●
● ●
● ●
● ●
●
●
● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ●
● ●
● ●
● ● ● ●
●
● ● ●
● ●
● ● ● ● ● ● ●
● ●
● ●
● ● ●
● ●
●
● ●
2 ● ● ●
●
●
●
● ●
●
● ●
● ● ●
● ● ●
● ● ●
● ●
● ●
● ● ● ●
● ●
● ●
● ●
● ● ●
● ●
● ● ●
● ●
● ●
● ●
● ● ●
● ● ●
● ●
● ●
● ●
●
● ● ●
●
● ● ● ●
●
● ● ● ● ●
●
● ● ●
● ● ●
●
● ● ● ●
● ●
● ● ●
●
● ● ●
●
● ●
● ●
● ●
●
● ● ●
1 ●
●
● ●
5 10 15
age
55
Linear regression
• function lm()
• formula model: response ~ x1 + x2 + etcetera
• intercept will be estimated automatically
• rows with missings (NA’s) will be discarded.
library(modelr)
56
Output lm
##
## Call:
## lm(formula = fev ~ age, data = fevDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.57539 -0.34567 -0.04989 0.32124 2.12786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.431648 0.077895 5.541 4.36e-08 ***
## age 0.222041 0.007518 29.533 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5675 on 652 degrees of freedom
## Multiple R-squared: 0.5722, Adjusted R-squared: 0.5716
## F-statistic: 872.2 on 1 and 652 DF, p-value: < 2.2e-16
57
Predictions
## # A tibble: 5 x 2
## age pred
## <dbl> <dbl>
## 1 25 5.98
## 2 35 8.20
## 3 45 10.4
## 4 55 12.6
## 5 65 14.9
58
Residuals
## histogram residuals
ggplot(fevDat) + geom_histogram(aes(x = resid))
## residualplot
ggplot(fevDat) + geom_point(aes(x = pred, y = resid))
59
Logistic regression
summary(model)$coefficients # Beta's
60
Exercise session 4
Information on the web
• www.stackoverflow.com
• www.r-project.org
• www.rweekly.org
• www.r-bloggers.com
• cheatsheets (zie ook Help > Cheatsheets)
61
End