Using R For Data Analysis: Course December 2010
Using R For Data Analysis: Course December 2010
Methods are made available as packages in R No need to wait until they are programmed into SPSS
R Course
This course teaches R proficiency The mechanics of R How to use other peoples R scripts How to write your own R scripts How to use R packages Focus: R as a language for data analysis
R Course 3 Saturday, 28 April 2012
Programme
Day 1: Essentials
Reading/writing data Data manipulation Simple plots
Day 3: Advanced R
Programming with R Packages Advanced plots
R Course 4 Saturday, 28 April 2012
R: a short history
S: a programming language for statistics
by John Chambers (Bell Labs) in the 1970s-80s
Two implementations
S-plus (1988, commercial) R (1993, GNU public license) by Robert Gentleman and Ross Ihaka (Univ. of Auckland)
Contributors:
R core team (19 members) Huge community (users, package writers)
R Course
Open source
Open source: what does it mean?
Free software
Volunteer work (by academics)
R Course
Obtaining R
Repository for R and packages www.r-project.org
Also on CRAN
Manuals (dont read them) Mailing lists + archives
R Course
R Course
R Course
R is just a calculator
> 2+4 [1] 6
> 18/3
[1] 6 > 2^10 [1] 1024 The order of calculations matters: use brackets! > 12/2*3
[1] 18
> 12/(2*3) [1] 2
R Course 10 Saturday, 28 April 2012
Variables
> x <- 5 > x
[1] 5
> x+4 [1] 9 The <- assigns a value to a variable Alternative: x = 5 works as well
Variable names
A variable can have any name you choose patients, Data, x2y2, sorted.data_file No space No @#!%$^()-+=!~`,<>}{][ . and _ are allowed Numbers allowed but not as first character Avoid variable names that have a meaning in R
Vectors
Use c to make vectors of numbers > x <- c(3,2,7) > x^2 [1] 9 4 49 Many functions in R operate on vectors > sqrt(x) [1] 1.732051 1.414214 2.645751 > sort(x) [1] 2 3 7 > sum(x) [1] 12 Vectors are central to R: in R everything is a vector
R Course 13 Saturday, 28 April 2012
R Course
15
> x[1]
[1] 3 > x[c(1,3)] [1] 3 7 > x[1:2] [1] 3 2
Subsetting by names
Text data can be used to attach names to vectors > deelnemers <- c(19, 8, 3)
> deelnemers
Leiden Amsterdam 19
R Course
8
18 Saturday, 28 April 2012
Extracting by names
Names can be useful for extracting from vectors > deelnemers[c("Leiden", "Amsterdam")]
Leiden Amsterdam
19 Extracting from names > names(deelnemers)
[1] "Leiden" "Amsterdam" "Rotterdam"
> names(deelnemers)[1]
[1] "Leiden"
> names(deelnemers[1])
[1] "Leiden"
R Course 19 Saturday, 28 April 2012
Replacing elements
The operator [ ] can be used to replace elements
[1] 3 3 7
> deelnemers["Amsterdam"] <- 7
> deelnemers
Leiden Amsterdam Rotterdam 19
R Course
7
20
3
Saturday, 28 April 2012
> x[c(3,1,2)]
[1] 7 3 2 > x[3:2] [1] 7 2 Similarly with names
> sort(x)
[1] 0 2 3 4 7 Order: gives the index vector that sorts a vector > order(x) [1] 5 2 1 4 3
Logicals
Data can also be logical (TRUE or FALSE) > 3>4
[1] FALSE
Can be vectors too (everything can be vector) > c(3,2,7) < c(2,3,4) [1] FALSE TRUE FALSE
[1]
> !x
TRUE
TRUE
TRUE FALSE
TRUE
[1] FALSE
TRUE FALSE
Pas op afrondfouten!
> sqrt(2)*sqrt(2) == 2 [1] FALSE
R Course 24 Saturday, 28 April 2012
Summarizing logicals
> x <- c(T, F, T, F) > any(x) [1] TRUE > all(x) [1] FALSE > sum(x)
[1] 2
The which function gives the indices of the TRUE entries > which(x)
[1] 1 3
> which(stad == "Leiden") [1] 1
R Course 25 Saturday, 28 April 2012
> x[x<5]
[1] 3 2
ifelse
Using logical vectors to recode variables
R Course
27
Missing values
Missing values are coded as NA (not available) > x <- c(3,2,NA,7)
> is.na(x)
[1] FALSE FALSE TRUE FALSE
> any(is.na(x))
[1] TRUE
R Course
29
Round brackets ()
Applying a function sort(x)
R Course
30
> seq[6]
[1] 6 > seq(6)
[1] 1 2 3 4 5 6
Functions do not overwrite data
R Course
31
Example data
R contains many example data sets Type data() to see a list Example data are immediately accessable in R E.g. islands
R Course
32
Getting help
R Course
33
R Course
34
cor is the function for correlation Function has four arguments, three of which have a default The argument y has default NULL.
The argument is optional
The argument
> sum(x,y,z)
[1] 72
R Course
38
The value section describes the type of object the function returns
R Course
39
Most help files have examples that you can immediately run Often based on data sets contained in R
Rape 20.16
Saturday, 28 April 2012
Programmers gibberish
Unfortunately, many of the help files were written by programmers, not users Below, part of the help file for the round function
R Course
41
R Course
43
Data.frames
Rectangular datasets can be stored as single objects Rows: (typically) subjects Columns: (typically) variables
R Course
44
> ?infert
Data of a matched case control study
Cases: women infertile after abortion Controls: women with intact fertility after abortion
R Course
45
Extracting in data.frames
Extracting rows and columns with [ ], but two arguments Note the position of the comma
By row:
> infert[1:5,] By column: > infert[,c(4,5)] By row and column
> infert[[1]]
Result is a vector
R Course 47 Saturday, 28 April 2012
Making a data.frame
Use data.frame (works like c for vectors) Combine vector with vector, data.frame with vector or data.frame with data.frame Make sure the order of the observations is identical! > mydata <- data.frame(group, measurement) > head(mydata) group measurement 1 2 A A 0.9233418 0.3545440
3
4 5
R Course
A
A B
0.1448936
0.6325367 0.2974718
48 Saturday, 28 April 2012
R Course
49
"2" "10"
"3" "11"
"4" "12
50
"5" "13"
"6" "14"
"7" "15"
"8" "16"
> ncol(infert)
[1] 8 > nrow(infert)
[1] 248
> dim(infert) [1] 248 8
> head(infert)
1 2 3 4 5 6
'data.frame': 248 obs. of 8 variables: $ education : Factor w/ 3 levels "0-5yrs","6-11yrs",..: 1 1 1 1 2 2 2 2 2 2 ... $ age : num 26 42 39 34 35 36 23 32 21 28 ... $ parity : num 6 1 6 4 3 4 1 2 1 2 ... $ induced : num 1 1 2 2 1 2 0 0 0 0 ... $ case : num 1 1 1 1 1 1 1 1 1 1 ... $ spontaneous : num 2 0 0 0 1 1 0 0 1 0 ... $ stratum : int 1 2 3 4 5 6 7 8 9 10 ... $ pooled.stratum: num 3 1 4 2 32 36 6 22 5 19 ...
R Course
52
Attaching a data.frame
Use attach to access the variables in a data.frame easily Make a data.frame your working data
> age
Error: object 'age' not found > infert$age [1] 26 42 39 34 35 36 23 32 21 28 29 37 31 29 31
[16]
> attach(infert) > age [1] 26 42 39 34 35 36 23 32 21 28 29 37 31 29 31
[16]
Undo with: detach(infert)
R Course
53
[1] 2 1 2 1
> as.character(age) [1] ">65"
R Course
"<=65" ">65"
54
"<=65"
Saturday, 28 April 2012
Factor levels
Coding from number to text can be displayed and changed
> levels(age)
[1] "<=65" ">65"
R Course
55
[1] <30
>40
Reading in data
R Course
57
Recommended use
> data <- read.table("myfile.txt", sep="\t", header = TRUE, quote="", row.names=1) Arguments used:
sep: the file is tab-delimited header: the first row is the column names quote: treat ' and " as ordinary text row.names: the first column consists of row names
R Course
58
Read.table frustrations
> data <- read.table("mydata.txt", sep="\t") Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 2 did not have 8 elements
R Course
59
check.names: should R allow column names which cannot be valid variable names (e.g. including a space?). Default: TRUE, which replaces illegal characters by dots.
e.g. time (survival) becomes time..survival.
R Course
60
Recommended use
> write.table(mydata, file="myfile.txt", sep="\t", quote=FALSE, col.names=NA) Arguments used:
sep: the file is tab-delimited quote: do not put quotes around text data col.names: export the row names in excel-style
R Course
61
R Course
62
Excel problems If possible, avoid excel to store/edit data Problem: no rigorous data structure in excel Potential hazards:
Text in columns with number data Typos in nominal variable data: too many categories Data (or trash) outside the data rectangle
R Course
63
Recommended use
> library(foreign) > read.spss("mydata.sav", to.data.frame=TRUE)
R Course
64
> load("mydata.RData")
or drag and drop the saved file into R
"x"
65 Saturday, 28 April 2012
If you say yes, the same objects are loaded automatically in the next R session
Recommended: say no: it is too easy to clog memory
R Course
66
Directories
Find the current working directory > getwd()
Scripts
R Course
68
.R
Collect the commands you use to analyze your data Store them in a script: a text file with R commands
Usual extension .R
Advantages:
Reproducibility Easy future re-use of code
Tinn-R
A good editor has at least
Syntax highlighting One click copy-paste to R
R Course
70
Script features
Source: Executing a whole script at once with or without echoing (writing the commands)
> source("myscript.R")
> source("myscript.R", echo=TRUE)
Comments in a script
Any line starting with # is not evaluated
R Course
71
Simple plots
R Course
72
Scatterplots
Basic scatterplot of x versus y > plot(Petal.Width, Petal.Length) Example data: iris Petal and sepal width and length of iris flowers
R Course
73
R Course
74
Coloring
The iris data consist of three species Color the dots by species > plot(Petal.Width, Petal.Length, col=Species) Uses numeric values of factor
1=black 2=red 3=green 4=blue 5=cyan 6=magenta
R Course 75 Saturday, 28 April 2012
Plot symbols
Alternative: use plot symbols > plot(Petal.Width, Petal.Length, pch= as.numeric(Species)) Note: no auto-conversion to numeric here
R Course
76
R Course
77
R Course
79
xlab="width", ylab="height",
col=Species) > points(Sepal.Width, Sepal.Length, pch=2,
col=Species)
R Course
80
R Course
81
Line graphs
Use type ="l" to connect successive points by lines > plot(temperature, pressure, type="l")
R Course
82