0% found this document useful (0 votes)
65 views82 pages

Using R For Data Analysis: Course December 2010

This document provides an overview and schedule for a 3-day course on using R for data analysis. Day 1 covers essential R skills like reading, writing, and manipulating data. Day 2 focuses on basic statistics in R, including descriptive statistics, tests, and models. Day 3 addresses advanced R programming, packages, and plots. The document also briefly outlines R's history as an open source programming language developed from S, and how to obtain R and its contributed packages from online repositories.

Uploaded by

Mitja Mitrovič
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views82 pages

Using R For Data Analysis: Course December 2010

This document provides an overview and schedule for a 3-day course on using R for data analysis. Day 1 covers essential R skills like reading, writing, and manipulating data. Day 2 focuses on basic statistics in R, including descriptive statistics, tests, and models. Day 3 addresses advanced R programming, packages, and plots. The document also briefly outlines R's history as an open source programming language developed from S, and how to obtain R and its contributed packages from online repositories.

Uploaded by

Mitja Mitrovič
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 82

Using R for Data Analysis

Course December 2010


Jelle Goeman Renee de Menezes

Why this course?


Modern data require modern statistical methods
Genomics/bioinformatics Advanced survival analysis Causal modeling

Methods are made available as packages in R No need to wait until they are programmed into SPSS

R Course

Saturday, 28 April 2012

What does this course teach?


This is not a statistics course!
To learn about statistics, follow Basic Methods and Reasoning in Biostatistics (and more advanced courses)

This course teaches R proficiency The mechanics of R How to use other peoples R scripts How to write your own R scripts How to use R packages Focus: R as a language for data analysis
R Course 3 Saturday, 28 April 2012

Programme
Day 1: Essentials
Reading/writing data Data manipulation Simple plots

Day 2: Basic statistics


Descriptive statistics Tests Model objects

Day 3: Advanced R
Programming with R Packages Advanced plots
R Course 4 Saturday, 28 April 2012

R: a short history
S: a programming language for statistics
by John Chambers (Bell Labs) in the 1970s-80s

Two implementations
S-plus (1988, commercial) R (1993, GNU public license) by Robert Gentleman and Ross Ihaka (Univ. of Auckland)

Contributors:
R core team (19 members) Huge community (users, package writers)

R Course

Saturday, 28 April 2012

Open source
Open source: what does it mean?

Free software
Volunteer work (by academics)

Anyone can see the source code


Anyone can contribute
write code report bugs write documentation

R Course

Saturday, 28 April 2012

Obtaining R
Repository for R and packages www.r-project.org

CRAN: Comprehensive R Archive Network


Free download
New major version every year (e.g. now R 2.11.0) New minor version in between (e.g. R 2.11.1 upcoming)

Also on CRAN
Manuals (dont read them) Mailing lists + archives

R Course

Saturday, 28 April 2012

Fear of the prompt

R Course

Saturday, 28 April 2012

Variables and vectors

R Course

Saturday, 28 April 2012

R is just a calculator
> 2+4 [1] 6

> 18/3
[1] 6 > 2^10 [1] 1024 The order of calculations matters: use brackets! > 12/2*3

[1] 18
> 12/(2*3) [1] 2
R Course 10 Saturday, 28 April 2012

Variables
> x <- 5 > x

[1] 5
> x+4 [1] 9 The <- assigns a value to a variable Alternative: x = 5 works as well

More exotic: 5 -> x


Variables are stored in memory, not on disk
R Course 11 Saturday, 28 April 2012

Variable names
A variable can have any name you choose patients, Data, x2y2, sorted.data_file No space No @#!%$^()-+=!~`,<>}{][ . and _ are allowed Numbers allowed but not as first character Avoid variable names that have a meaning in R

sort, seq, data.frame


Some names R does not allow (reserved) for, if, while
R Course 12 Saturday, 28 April 2012

Vectors
Use c to make vectors of numbers > x <- c(3,2,7) > x^2 [1] 9 4 49 Many functions in R operate on vectors > sqrt(x) [1] 1.732051 1.414214 2.645751 > sort(x) [1] 2 3 7 > sum(x) [1] 12 Vectors are central to R: in R everything is a vector
R Course 13 Saturday, 28 April 2012

Making vectors of regular sequences


> 1:5 [1] 1 2 3 4 5 > seq(2,10, by=2) [1] 2 4 6 8 10 > seq(2,10, length=5) [1] 2 4 6 8 10 > rep(1:2, times=2) [1] 1 2 1 2 > rep(1:2, each=2) [1] 1 1 2 2
R Course 14 Saturday, 28 April 2012

Extracting from vectors

R Course

15

Saturday, 28 April 2012

Extracting elements of vectors


Extracting by index > x <- c(3,2,7)

> x[1]
[1] 3 > x[c(1,3)] [1] 3 7 > x[1:2] [1] 3 2

Extracting by negative index: remove items


> x[-1] [1] 2 7
R Course 16 Saturday, 28 April 2012

Character vectors and names


Data can be non-numeric, e.g. text > stad <- c("Leiden", "Amsterdam", "Rotterdam") > stad [1] "Leiden" > toupper(stad) [1] "LEIDEN" Premade text > LETTERS[1:3] "AMSTERDAM" "ROTTERDAM" "Amsterdam" "Rotterdam"

[1] "A" "B" "C"


> letters[4:5] [1] "d" "e"
R Course 17 Saturday, 28 April 2012

Subsetting by names
Text data can be used to attach names to vectors > deelnemers <- c(19, 8, 3)

> names(deelnemers) <- stad


> deelnemers Leiden Amsterdam Rotterdam 19 8 3

Giving names in one go > deelnemers <- c(Leiden=19, Amsterdam=8)

> deelnemers
Leiden Amsterdam 19
R Course

8
18 Saturday, 28 April 2012

Extracting by names
Names can be useful for extracting from vectors > deelnemers[c("Leiden", "Amsterdam")]

Leiden Amsterdam
19 Extracting from names > names(deelnemers)
[1] "Leiden" "Amsterdam" "Rotterdam"

> names(deelnemers)[1]
[1] "Leiden"

> names(deelnemers[1])
[1] "Leiden"
R Course 19 Saturday, 28 April 2012

Replacing elements
The operator [ ] can be used to replace elements

> x <- c(3,2,7)


> x[2] <- 3 > x

[1] 3 3 7
> deelnemers["Amsterdam"] <- 7

> deelnemers
Leiden Amsterdam Rotterdam 19
R Course

7
20

3
Saturday, 28 April 2012

Changing the order of a vector


Extracting can also be used to reorder a vector > x <- c(3,2,7)

> x[c(3,1,2)]
[1] 7 3 2 > x[3:2] [1] 7 2 Similarly with names

Note the rev function


> rev(x) [1] 7 2 3
R Course 21 Saturday, 28 April 2012

Sort, order and rank


Sort: sorts a vector (increasing by default) > x <- c(3,2,7,4,0)

> sort(x)
[1] 0 2 3 4 7 Order: gives the index vector that sorts a vector > order(x) [1] 5 2 1 4 3

Rank: gives the rank of the data


> rank(x) [1] 3 2 5 4 1
R Course 22 Saturday, 28 April 2012

Logicals
Data can also be logical (TRUE or FALSE) > 3>4

[1] FALSE
Can be vectors too (everything can be vector) > c(3,2,7) < c(2,3,4) [1] FALSE TRUE FALSE

Making logicals by comparing numbers


Less/greater: <, >, <=, >=
Exact equality: == Not equal to: !=
R Course 23 Saturday, 28 April 2012

Calculating with logicals


Calculating with logicals: & (= AND), | (=OR), ! (=NOT) > x <- c(T, F, T, F) > y <- c(T, T, F, F) > x&y [1] TRUE FALSE FALSE FALSE > x|y

[1]
> !x

TRUE

TRUE

TRUE FALSE
TRUE

[1] FALSE

TRUE FALSE

Pas op afrondfouten!
> sqrt(2)*sqrt(2) == 2 [1] FALSE
R Course 24 Saturday, 28 April 2012

Summarizing logicals
> x <- c(T, F, T, F) > any(x) [1] TRUE > all(x) [1] FALSE > sum(x)

[1] 2
The which function gives the indices of the TRUE entries > which(x)

[1] 1 3
> which(stad == "Leiden") [1] 1
R Course 25 Saturday, 28 April 2012

Extracting with logicals


Logicals can also be used to extract from vectors

> x <- c(3,2,7)


> x[c(TRUE, FALSE, TRUE)] [1] 3 7

> x[x<5]
[1] 3 2

> deelnemers[stad == "Amsterdam"]


Amsterdam 8
R Course 26 Saturday, 28 April 2012

ifelse
Using logical vectors to recode variables

> p <- c(0.1, 0.01, 0.04)


> sign <- ifelse(p < 0.05, "sig", "ns") > sign

[1] "ns" "sig" "sig"

R Course

27

Saturday, 28 April 2012

Pasting vectors together


The c function concatenates numbers or vectors

> c(1:3, 3:1)


[1] 1 2 3 3 2 1 > c(deelnemers, Utrecht=1) Leiden Amsterdam Rotterdam 19 8 3 Utrecht 1

Finding the length of a vector

> x <- c(3,2,7)


> length(x) [1] 3
R Course 28 Saturday, 28 April 2012

Missing values
Missing values are coded as NA (not available) > x <- c(3,2,NA,7)

Finding missing values

> is.na(x)
[1] FALSE FALSE TRUE FALSE

> any(is.na(x))
[1] TRUE

R Course

29

Saturday, 28 April 2012

Square and round brackets


Square brackets [ ]
Extracting from data x[5]

Round brackets ()
Applying a function sort(x)

R Course

30

Saturday, 28 April 2012

Why different brackets?


Function and data can share the same name > seq <- 1:10

> seq[6]
[1] 6 > seq(6)

[1] 1 2 3 4 5 6
Functions do not overwrite data

Data do not overwrite functions


Call your data data and still use R function data()

R Course

31

Saturday, 28 April 2012

Example data
R contains many example data sets Type data() to see a list Example data are immediately accessable in R E.g. islands

Many more example data sets in packages!

Example data from packages have to be loaded first


Use data function (more about that later)

R Course

32

Saturday, 28 April 2012

Getting help

R Course

33

Saturday, 28 April 2012

Functions: getting help


> ?median > help(median)

R Course

34

Saturday, 28 April 2012

Functions: arguments and defaults

Median takes two arguments


x: the object of which the median is to be computed

na.rm: specifies whether missing values should be removed

na.rm has a default value FALSE


R Course 35 Saturday, 28 April 2012

Addressing arguments by name or order

Typical mixed call > median(z, na.rm=TRUE)

Other calls (completely by name or by order)


> median(z, TRUE) > median(na.rm=TRUE, x=z)
R Course 36 Saturday, 28 April 2012

Other ways of specifying defaults

cor is the function for correlation Function has four arguments, three of which have a default The argument y has default NULL.
The argument is optional

The argument method has default pearson


Other options are listed
R Course 37 Saturday, 28 April 2012

The argument

Some functions have a argument Supply as many arguments as you like

> sum(x,y,z)
[1] 72

R Course

38

Saturday, 28 April 2012

Objects returned by a function

The value section describes the type of object the function returns

R Course

39

Saturday, 28 April 2012

Examples in help files

Most help files have examples that you can immediately run Often based on data sets contained in R

> mean(USArrests, trim = 0.2)


Murder 7.42
R Course

Assault UrbanPop 167.60 66.20


40

Rape 20.16
Saturday, 28 April 2012

Programmers gibberish
Unfortunately, many of the help files were written by programmers, not users Below, part of the help file for the round function

R Course

41

Saturday, 28 April 2012

Getting help if you dont know the function name


Getting help not on a specific function is more troublesome

Option 1: search the help files


> ??mean > help.search("mean") Option 2: search the web
R mailing list archives: http://www.r-project.org/mail.html Google

Option 3: ask your local expert


R Course 42 Saturday, 28 April 2012

Data.frames and factors

R Course

43

Saturday, 28 April 2012

Data.frames
Rectangular datasets can be stored as single objects Rows: (typically) subjects Columns: (typically) variables

R Course

44

Saturday, 28 April 2012

Example: infert data.frame


Example data in R Description (very limited):

> ?infert
Data of a matched case control study
Cases: women infertile after abortion Controls: women with intact fertility after abortion

Several variables measured


Education, parity, etc.

R Course

45

Saturday, 28 April 2012

Extracting in data.frames
Extracting rows and columns with [ ], but two arguments Note the position of the comma

By row:
> infert[1:5,] By column: > infert[,c(4,5)] By row and column

> infert[2:4, "parity"]


Result is a smaller data.frame
R Course 46 Saturday, 28 April 2012

Accessing individual variables in data.frame


Alternative list-type extraction in data.frames Selecting component vectors of the data.frame

We come back to lists in the third day of the course


Extraction using $ > infert$parity Extraction using [[ ]] > infert[["parity"]]

> infert[[1]]
Result is a vector
R Course 47 Saturday, 28 April 2012

Making a data.frame
Use data.frame (works like c for vectors) Combine vector with vector, data.frame with vector or data.frame with data.frame Make sure the order of the observations is identical! > mydata <- data.frame(group, measurement) > head(mydata) group measurement 1 2 A A 0.9233418 0.3545440

3
4 5
R Course

A
A B

0.1448936
0.6325367 0.2974718
48 Saturday, 28 April 2012

Making a data.frame (cont)


With new variable names > mydata <- data.frame(mydata, measurement=x) Alternative > mydata <- cbind(mydata, measurement=x) Adding a new observation (new row) > mydata <- rbind(mydata, new.patient)

R Course

49

Saturday, 28 April 2012

row names and column names


A data.frame has names for rows and columns Not optional: obligatory Access names or change them > colnames(infert) > rownames(infert) > names(infert) identical to colnames > rownames(infert) <- my.row.names

Default row names


[1] "1" [9] "9"
R Course

"2" "10"

"3" "11"

"4" "12
50

"5" "13"

"6" "14"

"7" "15"

"8" "16"

Saturday, 28 April 2012

How large is the data.frame?


Finding the dimensions of a data.frame

> ncol(infert)
[1] 8 > nrow(infert)

[1] 248
> dim(infert) [1] 248 8

> length(infert) [1] 8


R Course 51 Saturday, 28 April 2012

Getting an overview of a data.frame


head: prints the first six rows
education age parity induced case spontaneous stratum pooled.stratum 0-5yrs 26 6 1 1 2 1 3 0-5yrs 42 1 1 1 0 2 1 0-5yrs 39 6 2 1 0 3 4 0-5yrs 34 4 2 1 0 4 2 6-11yrs 35 3 1 1 1 5 32 6-11yrs 36 4 2 1 1 6 36

> head(infert)
1 2 3 4 5 6

str: summarizes the structure of the data.frame

'data.frame': 248 obs. of 8 variables: $ education : Factor w/ 3 levels "0-5yrs","6-11yrs",..: 1 1 1 1 2 2 2 2 2 2 ... $ age : num 26 42 39 34 35 36 23 32 21 28 ... $ parity : num 6 1 6 4 3 4 1 2 1 2 ... $ induced : num 1 1 2 2 1 2 0 0 0 0 ... $ case : num 1 1 1 1 1 1 1 1 1 1 ... $ spontaneous : num 2 0 0 0 1 1 0 0 1 0 ... $ stratum : int 1 2 3 4 5 6 7 8 9 10 ... $ pooled.stratum: num 3 1 4 2 32 36 6 22 5 19 ...

R Course

52

Saturday, 28 April 2012

Attaching a data.frame
Use attach to access the variables in a data.frame easily Make a data.frame your working data

> age
Error: object 'age' not found > infert$age [1] 26 42 39 34 35 36 23 32 21 28 29 37 31 29 31

[16]
> attach(infert) > age [1] 26 42 39 34 35 36 23 32 21 28 29 37 31 29 31

[16]
Undo with: detach(infert)

R Course

53

Saturday, 28 April 2012

Categorical data: factors


Factor: object type for categorical data
Displayed as text Internally coded as numbers > age <- factor(c(">65", "<=65", ">65", "<=65")) > age [1] >65 <=65 >65 <=65 Levels: <=65 >65 > as.numeric(age)

[1] 2 1 2 1
> as.character(age) [1] ">65"
R Course

"<=65" ">65"
54

"<=65"
Saturday, 28 April 2012

Factor levels
Coding from number to text can be displayed and changed

> levels(age)
[1] "<=65" ">65"

> levels(age) <- c("young", "old")


> age [1] old young old young

Levels: young old

R Course

55

Saturday, 28 April 2012

Categorizing numerical data


Example: cutting age into age categories > cut(age, c(20,30,40,50))

[1] (20,30] (40,50] (30,40] (30,40] (30,40]


[6] (30,40] (20,30] (30,40] (20,30] (20,30] Levels: (20,30] (30,40] (40,50] > cut(age, c(20,30,40,50), labels=c("<30", "30-40", ">40"))

[1] <30

>40

30-40 30-40 30-40


30-40 <30 <30

[6] 30-40 <30

Levels: <30 30-40 >40


R Course 56 Saturday, 28 April 2012

Reading in data

R Course

57

Saturday, 28 April 2012

Reading data from text files


The read.table function

Recommended use
> data <- read.table("myfile.txt", sep="\t", header = TRUE, quote="", row.names=1) Arguments used:
sep: the file is tab-delimited header: the first row is the column names quote: treat ' and " as ordinary text row.names: the first column consists of row names

R Course

58

Saturday, 28 April 2012

Read.table frustrations
> data <- read.table("mydata.txt", sep="\t") Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 2 did not have 8 elements

Cause: the number of tabs is not the same in every line


Diagnose:
set fill=TRUE in read.table (adds missing tabs) Look at the read file to see what is wrong

R Course

59

Saturday, 28 April 2012

Other useful arguments


dec: the decimal point (. or ,) as.is: should text be read as factor (FALSE; default) or character (TRUE) skip: skip the first few lines of the file strip.white: remove spaces at the start or end of text

check.names: should R allow column names which cannot be valid variable names (e.g. including a space?). Default: TRUE, which replaces illegal characters by dots.
e.g. time (survival) becomes time..survival.

R Course

60

Saturday, 28 April 2012

Writing text files


Writing to tab-delimited text

Recommended use
> write.table(mydata, file="myfile.txt", sep="\t", quote=FALSE, col.names=NA) Arguments used:
sep: the file is tab-delimited quote: do not put quotes around text data col.names: export the row names in excel-style

R Course

61

Saturday, 28 April 2012

Reading from the clipboard


To copy/paste from excel

Make a block in excel and select copy (or crtl-C)


Read into R with > mydata <- read.table("clipboard") Similarly > write.table(mydata, "clipboard")

And paste into excel

R Course

62

Saturday, 28 April 2012

Excel problems If possible, avoid excel to store/edit data Problem: no rigorous data structure in excel Potential hazards:
Text in columns with number data Typos in nominal variable data: too many categories Data (or trash) outside the data rectangle

Regional settings problems: decimal seperator


Invisible spaces

R Course

63

Saturday, 28 April 2012

Reading data from SPSS


Facilities for reading SPSS data in foreign package

Recommended use
> library(foreign) > read.spss("mydata.sav", to.data.frame=TRUE)

Note: ignore warnings


Similar methods for reading SAS or STATA files See also read.xls

R Course

64

Saturday, 28 April 2012

Save and load


Saving objects for future use within R > save(x, mydata, file="mydata.RData")

Use extension .RData or .rda


Loading the data in a future R session

> load("mydata.RData")
or drag and drop the saved file into R

Find out the names of the variables youve loaded


> ls() [1] "mydata"
R Course

"x"
65 Saturday, 28 April 2012

The workspace: quitting R


Upon closing R asks Save workspace image Workspace: the collection of all objects in memory

If you say yes, the same objects are loaded automatically in the next R session
Recommended: say no: it is too easy to clog memory

Choose what to save and load: use save and load


Quit R by the prompt: > q(save="no")

R Course

66

Saturday, 28 April 2012

Directories
Find the current working directory > getwd()

[1] "m:/onderwijs/aio r cursus"


Change the working directory > setwd("m:/onderwijs/aio R cursus") Path specification: use / (single) or \\ (double) > load("m:\\onderwijs\\R exercises.RData") Find what files are in the working directory > dir()
R Course 67 Saturday, 28 April 2012

Scripts

R Course

68

Saturday, 28 April 2012

.R
Collect the commands you use to analyze your data Store them in a script: a text file with R commands

Usual extension .R
Advantages:
Reproducibility Easy future re-use of code

Use a special editor for R List of editors: http://www.sciviews.org/_rgui/projects/Editors.html


R Course 69 Saturday, 28 April 2012

Tinn-R
A good editor has at least
Syntax highlighting One click copy-paste to R

Recommended free editor (windows): tinn-R

R Course

70

Saturday, 28 April 2012

Script features
Source: Executing a whole script at once with or without echoing (writing the commands)

> source("myscript.R")
> source("myscript.R", echo=TRUE)

Comments in a script
Any line starting with # is not evaluated

Saving all the commands you typed at the prompt


> savehistory("myscript.R")

R Course

71

Saturday, 28 April 2012

Simple plots

R Course

72

Saturday, 28 April 2012

Scatterplots
Basic scatterplot of x versus y > plot(Petal.Width, Petal.Length) Example data: iris Petal and sepal width and length of iris flowers

R Course

73

Saturday, 28 April 2012

Tuning the plot: titles, labels


Default labels: variable names (often ugly) Change labels:
xlab: x-axis label ylab: y-axis label main: plot title

> plot(Petal.Width, Petal.Length, xlab="Petal Width", ylab="Petal Length", main="Iris flowers")

R Course

74

Saturday, 28 April 2012

Coloring
The iris data consist of three species Color the dots by species > plot(Petal.Width, Petal.Length, col=Species) Uses numeric values of factor
1=black 2=red 3=green 4=blue 5=cyan 6=magenta
R Course 75 Saturday, 28 April 2012

Plot symbols
Alternative: use plot symbols > plot(Petal.Width, Petal.Length, pch= as.numeric(Species)) Note: no auto-conversion to numeric here

R Course

76

Saturday, 28 April 2012

All plot symbols

R Course

77

Saturday, 28 April 2012

Log scale plots


Skewed data are sometimes better plotted on a log scale Use
log="x": x-axis log-transformed log="y": y-axis log-transformed log="xy": both axes logtransformed

Axis labels remain in original units > plot(Petal.Width, Petal.Length, log="xy")


R Course 78 Saturday, 28 April 2012

Adding points to an existing plot


points adds new points to an existing plot > plot(Petal.Width, Petal.Length) > points(Sepal.Width, Sepal.Length, pch=2)

Plot window is determined by call to plot


Points outside plot window are not drawn

R Course

79

Saturday, 28 April 2012

Prepare the window


Solution: prepare the plot window

> plot(Petal.Width, Petal.Length,


xlim=range(Petal.Width, Sepal.Width), ylim=range(Petal.Length, Sepal.Length),

xlab="width", ylab="height",
col=Species) > points(Sepal.Width, Sepal.Length, pch=2,

col=Species)

R Course

80

Saturday, 28 April 2012

Multi-color multi-symbol plot

R Course

81

Saturday, 28 April 2012

Line graphs
Use type ="l" to connect successive points by lines > plot(temperature, pressure, type="l")

R Course

82

Saturday, 28 April 2012

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy