0% found this document useful (0 votes)
43 views

R Course Own English HS

R and RStudio are tools for data analysis and visualization. R is the programming language and RStudio is an integrated development environment (IDE) that provides a user-friendly interface for working in R. The tidyverse is a collection of R packages that provide a consistent set of functions for data manipulation, visualization, and analysis using verbs like filter(), summarise(), and group_by(). Tables in R like dataframes and tibbles store data in columns that can be manipulated using these verbs and the pipe operator (%>%) to chain commands together.

Uploaded by

Pedro Henrique
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

R Course Own English HS

R and RStudio are tools for data analysis and visualization. R is the programming language and RStudio is an integrated development environment (IDE) that provides a user-friendly interface for working in R. The tidyverse is a collection of R packages that provide a consistent set of functions for data manipulation, visualization, and analysis using verbs like filter(), summarise(), and group_by(). Tables in R like dataframes and tibbles store data in columns that can be manipulated using these verbs and the pipe operator (%>%) to chain commands together.

Uploaded by

Pedro Henrique
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Introduction to R / RStudio

BMS

Karin Groothuis-Oudshoorn & Robert Marinescu-Muster


August 27th

1
Program today

• Introduction R / RStudio
• tables in R (with dplyr)
• Make table objects
• Select rows/columns
• Sort rows
• Add new columns
• Aggregate data
• Make graphs with ggplot2
• Simple regression models

2
R for Data Science

3
R for Data Science boek

book online: http://r4ds.had.co.nz/


4
R en RStudio

R: the engine RStudio: Dashboard

5
RStudio

6
R and R packages

R: a new phone R packages: Apps that you can


download

7
Installing and using packages

1. Install package: only once


2. Lading package: every session
3. Reinstall package: if you update R
4. List with default packages:
## which packages are default loaded?
search()

## [1] ".GlobalEnv" "package:stats" "package:graphics"


## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"

8
Installing packages

The easy way: in the lower right


panel of RStudio:
a) Click on the ’Packages’ tab
b) Click on ’Install’
c) Type the name of the
package under ’Packages’
d) Click on ’Install’

9
Tidyverse

• Developer: Hadley Wickham (van RStudio)

• Collection of packages: dplyr, ggplot2, tibble, readr,


tidyr, purrr, stringr, forcats

• More consistent than standard R

• A good starting point to learn R

• Is not standard R!

• Webpage: http://www.tidyverse.org
install.packages(tidyverse)
library(tidyverse)
10
Base R
Variabels / Vectors: int, num

a <- 1L
class(a)

## [1] "integer"
vec1 <- c(1L,3L,5L,7L)
class(vec1)

## [1] "integer"
vec2 <- c(10.5,3.2,pi,4)
class(vec2)

## [1] "numeric"
vec1 + vec2

## [1] 11.500000 6.200000 8.141593 11.000000

11
Variabels / Vectors: chr, lgl, date

• vec3: character vector ("character")


• vec4: logical vector ("logical")
• vec5: date vector ("date")
vec3 <- c("low","low","high","medium")
vec4 <- c(TRUE,FALSE,TRUE,FALSE)
library(lubridate)
vec5 <- ymd(c("2000-9-14","2002-7-3",
"2004-4-14","2004-6-10"))

12
Tables: dataframe or tibble

• Columns of a dataframe / tibble are vectors


• Each column/vector has equal length and may have different
types of data
library(tibble)
table <- tibble(
col1 = vec1,
col2 = vec2,
col3 = vec3,
col4 = vec4,
col5 = vec5
)
table

## # A tibble: 4 x 5
## col1 col2 col3 col4 col5
## <int> <dbl> <chr> <lgl> <date>
## 1 1 10.5 low TRUE 2000-09-14
## 2 3 3.2 low FALSE 2002-07-03
## 3 5 3.14 high TRUE 2004-04-14
13
## 4 7 4 medium FALSE 2004-06-10
Work with columns and vectors

• Use $ to work with columns


• Select elements or parts of a vector / table with the brackets: [
and ]
table$col1 # select the column vec1 from table
table[,1] # select the first column from the table
table[1,] # select the first row from the table
table[1:2,3:4] # select the first 2 row, 3 and 4e column
table[table$col3 == "low",] # select all rows with vec3 == "low"

14
Functions

• a function has a name and arguments


• the arguments are between parenthesis ().
• use no space between function name and the parenthesis!
• use the help() function for more info of a function
sum(c(3,5,7))

## [1] 15
mean(c(10,14,2000))

## [1] 674.6667
mean(c(10,14,NA), na.rm = T)

## [1] 12

15
Objects

• object contains the result of an assignment


• object can be used in a new assignment
• object you can make with <- (or with =)
• object will be save in global environment (so upper left panel)
• object global environment will be deleted if you quit
R/Rstudio (except if you save it)
a_row <- seq(from = 100, to = 200, by = 10)
a_row2 <- 2*a_row
a_row2

## [1] 200 220 240 260 280 300 320 340 360 380 400

16
Naming of objects

• the name of an object starts with a letter


• the name of an object contains only letters, numbers, _ and .
• small letters and capital letters are NOT the same
• don’s use spaces in names

17
R script

• Document with lines of assignments / code


• Extension of the file is .R
• Comments can be written in it starting with a hashtag (#)
• Run the code with the Run-button (in the middle/ above in
RStudio)
• Output will be shown in the Console
• Advances reproducibility

18
RStudio Project

• A directory on the hard disk


• Put scripts and data in a project directory
• A project directory is a working directory for R
• RStudio places some standard files in a project directory
• You can make a project directory with File > New Project
• You can open an existing project directory with File > Open
Project
• Data on a project will be saved in the file
<jouw_projectnaam>.Rproj.

19
Exercise session 1
Tables in R
Tables

• A table object has a name


• A talbe has rows and columns
• Data in a column should have the same type
• Columns are all of equal size
• Columns have a name
• Rows sometimes have a name (possible for dataframes, not
for tibbles)

20
Example: Gapminder table

library(gapminder)
gapminder

## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows

21
Base R versus tidyverse

# Base R
asia <- gapminder[gapminder$continent == "Asia",]
mean(asia$lifeExp)

## [1] 60.0649
library(dplyr) # Tidyverse
gapminder %>%
filter(continent == "Asia") %>%
summarise(mean_exp = mean(lifeExp))

## # A tibble: 1 x 1
## mean_exp
## <dbl>
## 1 60.1
22
The pipe operator

• A way of chaining commands next to each other


• You can read it as and then
• Part of package magittr (but is automatically loaded with
dplyr)
gapminder %>%
filter(continent == "Asia") %>%
summarise(mean_exp = mean(lifeExp))
# without pipe
temp <- filter(gapminder, continent == "Asia")
temp <- summarise(temp, mean_exp = mean(lifeExp))
23
Seven verbs for data wrangling (package dplyr)

• filter
• summarise
• group_by (and ungroup)
• mutate
• arrange
• rename
• select

24
filter()

• select rows from a table


• arguments are filters that you want to apply
• use == to compare values
gap_2007 <- gapminder %>%
filter(year == 2007)
head(gap_2007, n = 4)

## # A tibble: 4 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.

25
filter(): more filters simultaneously (OR operator)

• use | to use more filters simultaneously


• checks if at least one filter is satisfied
gapminder %>%
filter(year == 2007 | continent == "Asia") %>%
sample_n(8)

## # A tibble: 8 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Djibouti Africa 2007 54.8 496374 2082.
## 2 Bahrain Asia 1977 65.6 297410 19340.
## 3 Thailand Asia 1997 67.5 60216677 5853.
## 4 Namibia Africa 2007 52.9 2055080 4811.
## 5 Singapore Asia 1982 71.8 2651869 15169.
## 6 Korea, Dem. Rep. Asia 2007 67.3 23301725 1593.
## 7 Nepal Asia 1997 59.4 23001113 1011.
## 8 Reunion Africa 2007 76.4 798094 7670.

26
filter(): more filters simultaneously (AND operator)

• use, of & or to use more filters


• checks whether all filters are satisfied
gapminder %>%
filter(year == 2002, continent == "Asia")

## # A tibble: 33 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2002 42.1 25268405 727.
## 2 Bahrain Asia 2002 74.8 656397 23404.
## 3 Bangladesh Asia 2002 62.0 135656790 1136.
## 4 Cambodia Asia 2002 56.8 12926707 896.
## 5 China Asia 2002 72.0 1280400000 3119.
## 6 Hong Kong, China Asia 2002 81.5 6762476 30209.
## 7 India Asia 2002 62.9 1034172547 1747.
## 8 Indonesia Asia 2002 68.6 211060000 2874.
## 9 Iran Asia 2002 69.5 66907826 9241.
## 10 Iraq Asia 2002 57.0 24001816 4391.
## # ... with 23 more rows
27
filter(): use of %in%

• repeatedly use of | with ==


gapminder %>%
filter(year %in% c(1987,1992,1997),
country %in% c("Netherlands","Belgium"))

## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Belgium Europe 1987 75.4 9870200 22526.
## 2 Belgium Europe 1992 76.5 10045622 25576.
## 3 Belgium Europe 1997 77.5 10199787 27561.
## 4 Netherlands Europe 1987 76.8 14665278 23651.
## 5 Netherlands Europe 1992 77.4 15174244 26791.
## 6 Netherlands Europe 1997 78.0 15604464 30246.

28
summarise()

• function to calculate aggregate statistics


stats_2007 <- gapminder %>%
filter(year == 2007) %>%
summarise(max_exp = max(lifeExp),
mean_exp = mean(lifeExp),
sd_exp = sd(lifeExp))
stats_2007

## # A tibble: 1 x 3
## max_exp mean_exp sd_exp
## <dbl> <dbl> <dbl>
## 1 82.6 67.0 12.1

29
combination of group_by() and summarise()

• calculate a numerical summary per group using a


categorical variable
stats_2007 <- gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarise(max_exp = max(lifeExp),
mean_exp = mean(lifeExp),
sd_exp = sd(lifeExp),
aantal = n())
stats_2007

## # A tibble: 5 x 5
## continent max_exp mean_exp sd_exp aantal
## <fct> <dbl> <dbl> <dbl> <int>
## 1 Africa 76.4 54.8 9.63 52
## 2 Americas 80.7 73.6 4.44 25
## 3 Asia 82.6 70.7 7.96 33
## 4 Europe 81.8 77.6 2.98 30
## 5 Oceania 81.2 80.7 0.729 2
30
mutate()

• create a new column with a fixed value


gap_plus <- gapminder %>%
mutate(just_one = 1)
head(gap_plus, n=4)

## # A tibble: 4 x 7
## country continent year lifeExp pop gdpPercap just_one
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 1
## 2 Afghanistan Asia 1957 30.3 9240934 821. 1
## 3 Afghanistan Asia 1962 32.0 10267083 853. 1
## 4 Afghanistan Asia 1967 34.0 11537966 836. 1

31
mutate()

• create a new column with other variables/columns


gap_gdp <- gapminder %>%
mutate(gdp = pop * gdpPercap)
head(gap_gdp, n=4)

## # A tibble: 4 x 7
## country continent year lifeExp pop gdpPercap gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.

32
mutate()

• adapt existing columns


• also if you have made a column in the same mutate statement
gap_gdp <- gap_gdp %>%
mutate(gdp = gdp/1000000)
head(gap_gdp, n=4)

## # A tibble: 4 x 7
## country continent year lifeExp pop gdpPercap gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6567.
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7585.
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8759.
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9648.

33
combination of mutate() and group_by

• mutate colums by subcategory


gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
mutate(rank = rank(desc(lifeExp))) %>% arrange(rank)

## # A tibble: 142 x 7
## # Groups: continent [5]
## country continent year lifeExp pop gdpPercap rank
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Australia Oceania 2007 81.2 20434176 34435. 1
## 2 Canada Americas 2007 80.7 33390141 36319. 1
## 3 Iceland Europe 2007 81.8 301931 36181. 1
## 4 Japan Asia 2007 82.6 127467972 31656. 1
## 5 Reunion Africa 2007 76.4 798094 7670. 1
## 6 Costa Rica Americas 2007 78.8 4133884 9645. 2
## 7 Hong Kong, China Asia 2007 82.2 6980412 39725. 2
## 8 Libya Africa 2007 74.0 6036914 12057. 2
## 9 New Zealand Oceania 2007 80.2 4115771 25185. 2
## 10 Switzerland Europe 2007 81.7 7554661 37506. 2 34
## # ... with 132 more rows
arrange()

• Order rows on the basis of other columns


• Default is ascending or alphabetical order
• NAs will be placed in the end
gapminder %>% arrange(year,country) %>% head(5)

## # A tibble: 5 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Albania Europe 1952 55.2 1282697 1601.
## 3 Algeria Africa 1952 43.1 9279525 2449.
## 4 Angola Africa 1952 30.0 4232095 3521.
## 5 Argentina Americas 1952 62.5 17876956 5911.

35
arrange() : descending order

• use the function desc()


gapminder %>% filter(year == 2007) %>%
arrange(desc(lifeExp)) %>% head(n = 5)

## # A tibble: 5 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Japan Asia 2007 82.6 127467972 31656.
## 2 Hong Kong, China Asia 2007 82.2 6980412 39725.
## 3 Iceland Europe 2007 81.8 301931 36181.
## 4 Switzerland Europe 2007 81.7 7554661 37506.
## 5 Australia Oceania 2007 81.2 20434176 34435.

36
group_by and ungroup()

• Split the categories with group_by() and unsplit with


ungroup()
temp <- gapminder %>%
group_by(continent, country)
head(temp, n = 1)

## # A tibble: 1 x 6
## # Groups: continent, country [1]
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
temp <- temp %>% ungroup()
head(temp, n = 1)

## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.

37
select(): select columns

• select columns from a table


## select continent and country
gapminder %>% select(continent, country)

## select all column except gdpPercap


gapminder %>% select(-gpdPercap)

## select all columns continent untill pop


gapminder %>% select(continent:pop)

## select all columns except for continent untill pop


gapminder %>% select(-(continent:pop))

38
rename(): change column name

• change name of column


• left from = new name
• right from = old name
gapminder_NL <- gapminder %>%
rename(land = country,
jaar = year,
levensverwachting = lifeExp)

head(gapminder_NL, n = 2)

## # A tibble: 2 x 6
## land continent jaar levensverwachting pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.

39
Some usefull functions for tables

nrow(gapminder) # number rows table


ncol(gapminder) # number columns table
names(gapminder) # columns names table
str(gapminder) # structure table
head(gapminder) # head of table (first rows)
tail(gapminder) # tail of table (last rows)

40
Difference with and without pipe

## select continent en country


temp <- gapminder %>% select(continent, country) ## with pipe
temp <- select(gapminder, continent, country) ## without pipe

## Add column with gpd


gap_gdp <- gapminder %>%
mutate(gdp = pop * gdpPercap) ## with pipe
gap_gdp <- mutate(gapminder, gdp = pop * gdpPercap) ## without pipe

## calculate maximale life expectacy


stats_2007 <- gapminder %>%
filter(year == 2007) %>%
summarise(max_exp = max(lifeExp)) ## with pipe

temp <- filter(gapminder, year == 2007) ## without pipe


stats_2007 <- summarise(temp, max_exp = max(lifeExp))

41
Import datasets

• package readr
• via menu (in Environment button )
library(readr)
lbw <- read_csv("data/lbw_data.csv")

42
Make categorical variables

• package forcats
library(forcats)
lbw <- lbw %>%
mutate(low = factor(low),
low = fct_recode(low,"No" = "0", "Yes" = "1"),
race = factor(race),
race = fct_recode(race,"White" = "1","Black" = "2",
"Other" = "3"),
smoke = factor(smoke),
smoke = fct_recode(smoke,"No" = "0", "Yes" = "1"),
ht = factor(ht),
ht = fct_recode(ht,"No" = "0", "Yes" = "1"),
ui = factor(ui),
ui = fct_recode(ui,"No" = "0", "Yes" = "1"))

43
Exercise session 2
Make graphs with ggplot2
Package ggplot2

• Not standard R
• Based on Grammar of Graphics
• Graph = Data + Layout + Coordinate system
• Graph can have more layers
• A layer has aesthetic (aes) properties coupled with properties of
data
• Handy cheatsheet: https://www.rstudio.com/wp-
content/uploads/2015/03/ggplot2-cheatsheet.pdf

44
Scatterplot example code

library(ggplot2)

ggplot(lbw, aes(x = age, y = bwt)) +


geom_point() +
geom_smooth(method = "lm", se = FALSE) +
coord_cartesian(xlim = c(15,46), ylim = c(0, 5500)) +
labs(title = "Birthweigth and age mother",
x = "Age mother",
y = "Birth weight")

45
Scatterplot example
Birthweigth and age mother


● ● ●

● ●
4000 ●
● ●

● ●

● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ●
● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ●

Birth weight

● ● ● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ● ●

● ● ● ●
● ●

● ● ●

● ● ● ● ● ●

● ● ● ●
● ● ● ● ●
● ● ●
● ● ●

● ●
● ●
● ●
● ● ●
● ● ● ●
2000 ●

● ● ●
● ● ●
● ●


● ●


0 46
Scatterplot example code: split according to smoking

• aes(): col, shape, size


library(ggplot2)

ggplot(lbw, aes(x = age, y = bwt)) +


geom_point(aes(col = smoke)) +
geom_smooth(method = "lm", se = FALSE) +
labs( x = "Age mother",
y = "Birth weight",
color = "Smoking")

47
Scatterplot split according to smoking

5000 ●


● ● ●

● ●
4000 ●
● ● ● ●



● ● ● ●


● ●
● ● ● ●


● ●
● ● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ●
● ●


● ● ● ●

● ● ●
● ● ● ●
● ● ●
● ●
● ●
Birth weight

● ● ● ● ●
3000
● ●


● ●
● ●

● Smoking
● ● ●

● ● ●
● ●


● ● ● ● ● No
● ● ●
● ● ●

● ●
● ● ● Yes
● ●

● ●

● ● ● ●
● ● ●
● ●

● ●
● ● ●
● ● ● ●
● ●
● ●

● ●
● ● ●
● ● ●
● ● ●
● ●
● ●
2000 ●
● ● ● ●
● ● ●



● ●


1000


48
Histogram

library(ggplot2)

ggplot(data = lbw) +
geom_histogram(aes(x = age, y = ..count.., fill = race)) +
labs(x = "Age mother",
y = "Number",
fill = "Race")

49
Histogram

20

Race
Number

White
Black
Other

10

0 50
Boxplot

library(ggplot2)

ggplot(data = lbw) +
geom_boxplot(aes(x = race, y = age, fill = smoke)) +
labs(x = "Race mother",
y = "Age mother",
fill = "Smoking")

51
Boxplot

40


Age mother

Smoking
30
No
Yes

20

52
Subplots

library(ggplot2)

ggplot(data = lbw) +
geom_histogram(aes(x = age, y = ..count..)) +
labs(x = "Age mother",
y = "Number") +
facet_wrap( ~ race, nrow = 1)

53
Histogram split with smoking
White Black Other

6
Number

20 30 40 20 30 40 20 30 40
Age mother

54
Exercise session 3
Simple models
Example: relation between FEV and age
6

● ●
5
● ●
● ●
● ●

● ●


● ●

● ●
● ●

● ●
● ●
● ●
● ● ●

● ● ●
● ●
● ● ● ●

4 ●
● ● ● ● ●
● ●

● ● ● ●
● ● ● ●
● ●
● ● ● ● ●
● ● ● ● ●

● ●

● ● ●
● ●
● ●
● ●
● ● ●
● ●
● ●
● ● ● ●
● ●
● ●
● ●
fev

● ●
● ● ● ●
● ● ● ●
● ● ●
● ●
● ●
● ● ●
● ●
● ●

● ● ● ●
● ● ● ●
● ●
● ●

● ●
● ●
● ●
● ● ● ●


● ● ● ●
3 ●
● ●


● ●





● ●

● ●
● ● ● ● ● ● ●

● ●
● ●
● ● ●

● ●
● ● ●
● ● ● ● ● ●
● ●

● ● ●
● ● ●
● ● ● ●

● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ●
● ●
● ● ●

● ● ●
● ●
● ●
● ● ● ● ● ● ●

● ● ●
● ● ●

● ●
● ● ● ●

● ● ● ● ● ●
● ●
● ● ●
● ●

● ●
● ●
● ●
● ●


● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ●
● ●
● ●
● ● ● ●

● ● ●
● ●
● ● ● ● ● ● ●
● ●
● ●
● ● ●
● ●

● ●
2 ● ● ●



● ●

● ●
● ● ●
● ● ●
● ● ●
● ●
● ●
● ● ● ●
● ●
● ●
● ●
● ● ●
● ●
● ● ●
● ●
● ●
● ●
● ● ●
● ● ●
● ●
● ●
● ●

● ● ●

● ● ● ●

● ● ● ● ●

● ● ●
● ● ●

● ● ● ●
● ●
● ● ●

● ● ●

● ●
● ●
● ●

● ● ●
1 ●


● ●

5 10 15
age

55
Linear regression

• function lm()
• formula model: response ~ x1 + x2 + etcetera
• intercept will be estimated automatically
• rows with missings (NA’s) will be discarded.
library(modelr)

fevDat <- read_csv(file = "data/fev.csv")


model <- lm(fev ~ age, data = fevDat)

summary(model) # Summary model


coefficients(model) # Beta's
nobs(model) # number of observations without NA's

56
Output lm

##
## Call:
## lm(formula = fev ~ age, data = fevDat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.57539 -0.34567 -0.04989 0.32124 2.12786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.431648 0.077895 5.541 4.36e-08 ***
## age 0.222041 0.007518 29.533 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5675 on 652 degrees of freedom
## Multiple R-squared: 0.5722, Adjusted R-squared: 0.5716
## F-statistic: 872.2 on 1 and 652 DF, p-value: < 2.2e-16

57
Predictions

fevDat <- add_predictions(fevDat, model)


temp2 <- tibble(age = c(25,35,45,55,65))
temp2 <- add_predictions(temp2, model)
temp2

## # A tibble: 5 x 2
## age pred
## <dbl> <dbl>
## 1 25 5.98
## 2 35 8.20
## 3 45 10.4
## 4 55 12.6
## 5 65 14.9

58
Residuals

• add_residuals adds column resid with residual i


fevDat <- add_residuals(fevDat, model)

## histogram residuals
ggplot(fevDat) + geom_histogram(aes(x = resid))
## residualplot
ggplot(fevDat) + geom_point(aes(x = pred, y = resid))

59
Logistic regression

model <- glm(low ~ race + smoke + ht + ui + age + lwt,


data = lbw, family = "binomial")

summary(model)$coefficients # Beta's

## Estimate Std. Error z value Pr(>|z|)


## (Intercept) 0.43724022 1.191931228 0.3668334 0.713743272
## raceBlack 1.28064059 0.526694968 2.4314654 0.015037885
## raceOther 0.90188006 0.434362303 2.0763313 0.037863316
## smokeYes 1.02757057 0.393930669 2.6085061 0.009093838
## htYes 1.85761692 0.688848290 2.6966996 0.007003041
## uiYes 0.89538678 0.448493792 1.9964307 0.045887062
## age -0.01825600 0.035354134 -0.5163752 0.605592417
## lwt -0.01628503 0.006858566 -2.3744076 0.017577136

60
Exercise session 4
Information on the web

• www.stackoverflow.com
• www.r-project.org
• www.rweekly.org
• www.r-bloggers.com
• cheatsheets (zie ook Help > Cheatsheets)

61
End

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy