L1 Intro R
L1 Intro R
October 2024
1. Starting with R
R is a scientific software package developed The R Foundation for Statistical Computing and distributed
from their R website.
The name is partly based on the (first) names of the first two R authors (Robert Gentleman and Ross Ihaka),
and partly a play on the name of the Bell Labs language'S. S is a very high level language and an environment
for data analysis and graphics. The earliest beginnings of S came from discussions in the spring of 1976,
among a group of five people at Bell Labs: Rick Becker, John Chambers, Doug Dunn, Paul Tukey, and
Graham Wilkinson. We can regard S as a language with three current implementations or “engines”, the
“old S engine” (S version 3; S-Plus 3.x and 4.x), the “new S engine” (S version 4; S-Plus 5.x and above), and
R. Given this understanding, asking for “the differences between R and S’ ’ really amounts to asking for the
specifics of the R implementation of the S language, i.e., the difference between the R and S engines. R is
just a particular implementation of S Language.
Although R language is slightly different from MATLAB, its functionality is similar and offers powerful
packages for pattern recognition and multivariate analysis. It has a strong community with very active forums
and under continuous development of the language. The implementation of this language has been done in a
very portable fashion and in a way which requires relatively low resources machine profile. Ports exists for
many members of the Unix family of operating systems, including AIX, FreeBSD, GNU/Linux, HP-UX, Irix,
MacOS X, Solaris and Tru64. In addition there is a version for Microsoft Windows (v.~98 onwards),
There is a number of user interfaces available for R, a mode for emacs ESS : Emacs Speaks Statistics, Rcmndr
and RKWard}. R distributions for OS X and Microsoft versions are packaged with custom GUIs.
1.2 Starting R
Under Windows, just follow: Start -> All Programs-> R -> R . On Mac OSX and Microsoft Windows
R comes with a native GUI, on GNU/Linux and other UNIX implementations, a GUI can be started later on
from the command line interface
1 You can install any package into R through the command install.packages(“package name”). To build a base installation
of R in an Ubuntu Linux Computer, you can use sudo apt-get install r-recommended. You can download rstudio from
http://www.rstudio.com/products/rstudio/download/
1
Important!! R is case sensitive!!!
R i s a c o l l a b o r a t i v e p r o j e c t with many c o n t r i b u t o r s .
Type ' c o n t r i b u t o r s ( ) ' f o r more i n f o r m a t i o n and
' c i t a t i o n ( ) ' on how t o c i t e R o r R p a c k a g e s i n p u b l i c a t i o n s .
Type ' demo ( ) ' f o r some demos , ' h e l p ( ) ' f o r on−l i n e help , o r
' h e l p . s t a r t ( ) ' f o r an HTML b r o w s er i n t e r f a c e t o h e l p .
Type ' q ( ) ' t o q u i t R.
##
## ls> .Ob <- 1
##
## ls> ls(pattern = "O")
## character(0)
##
## ls> ls(pattern= "O", all.names = TRUE) # also shows ".[foo]"
## [1] ".Ob"
##
## ls> # shows an empty list because inside myfunc no variables are defined
## ls> myfunc <- function() {ls()}
2
##
## ls> myfunc()
## character(0)
##
## ls> # define a local variable inside myfunc
## ls> myfunc <- function() {y <- 1; ls()}
##
## ls> myfunc() # shows "y"
## [1] "y"
The help system can be searched for a given pattern through the help.search() command.
## [1] 3.141593
b
## [1] 1.414214
After the assignments, list the objects that live in memory with the objects() function:
{ r objects} objects()
Then you can remove objects from your session with the command rm():
rm(myfunc,a)
objects()
## [1] "b"
3
Also, R can be called for batch processing from the shell or command line as follows2 ,
$ R [options] [< infile] [> outfile]
%$ where the commands to process are found in the infile and the output will be logged to outfile.
2. Simple types
2.1 Vectors
As many other languages, the simplest structure is a numeric vector, defined with help of the concatenate
c() function,
x <- c( 2.3, 3, 5, 7, 7.5, 7, 5, 4)
y <- c(x,3,x)
Simple arithmetic is like other languages, +, -, *, /, \^{} as well as typical functions log, exp, sin, cos, tan, sqrt.
A vector with regular spacing can be generated with the seq() function:
sq <-seq( 5, -2, by = -0.5)
Note that arguments in R are mainly defined by name, and in absence of name, by order, so the following
code would produce the same effect,
sq <- seq( by = -0.5, to = -2,
from = 5)
2.4 Indexing
Any object can be indexed, either vectors, arrays or data frames as you will see later on. The elements of any
vector are retrieved through the [ operator, as in follows,
“‘r x[1]“‘
## [1] 2.3
2 This example for the UNIX case
4
x[3]
## [1] 5
x[c(3,5)]
## [1] 17.5
Note that R admits an advanced indexing, aiming for good meta-data management, for example:
geneexp <- c(1.3, -4, 3,4, 0.3)
geneexp
## F7 F11
## 3 4
2.5. Factors
Factors are a special an useful type, as these are the natural form for representing categorical data. Any
vector can be converted to a factor with the as.factor() function, or created with the factor() function,
cvf <- as.factor(cv)
print(cvf)
## [1] X1 Y2 X1 Y1 X2 Y3 X2 Y3
## Levels: X1 X2 Y1 Y2 Y3
summary(cvf)
## X1 X2 Y1 Y2 Y3
## 2 2 1 1 2
Factors can be very useful from a practical perspective. For example, assume you have a vector containing a
FVII concentration in blood of a for individuals carrying different diseases:
disease <- c("Hemophilia A", "Thrombocytopenia", "Ehlers-Danlos syndrome",
"Hemophilia", "Ehlers-Danlos syndrome", "Thrombocytopenia", "Vasculitis",
"Vasculitis", "Vasculitis", "Thrombocytopenia", "Myeloproliferative",
"Thrombocytopenia", "Myeloproliferative", "Hemophilia", "Thrombocytopenia",
5
"Ehlers-Danlos syndrome",
"Thrombocytopenia", "Thrombocytopenia", "Thrombocytopenia", "Vasculitis",
"Hemophilia A", "Ehlers-Danlos syndrome", "Thrombocytopenia", "Myeloproliferative",
"Ehlers-Danlos syndrome", "Hemophilia B", "Hhemophilia A", "Vasculitis",
"Ehlers-Danlos syndrome", "Vasculitis")
diseasef <- factor(disease)
diseasef
## [,1] [,2]
## [1,] 0.05753091 0.01387335
## [2,] -0.86792728 0.32248995
## [3,] -0.53033853 -0.37494762
## [4,] 1.72361002 -0.95641824
## [5,] -2.18436755 1.17342998
Empty matrices can be achieved with the array() function (e.g. array(0,c(5,2))). Similarly to the vectors,
elements in the arrays can be accessed through the [] operator:
z[3,2]
## [1] -0.3749476
6
z[,2]
## [,1] [,2]
## [1,] 8.780160 -4.291945
## [2,] -4.291945 2.536452
Other operators of interest for matrix-matrix, matrix-vector and matrix are: diag(), crossprod(), solve(),
eigen(), svd(), etc.
## [1] 1e+05
ty <- "type"
l[[ty]]
## List of 3
## $ name : chr "FVII"
## $ type : chr "Coagulation Factor"
## $ range: num [1:2] 1e+02 1e+05
7
Lists can be concatenated with the c() function
A Data Frame is a special list of class “data.frame”, which can be mainly regarded as a matrix with columns
possibly of different modes (or types) and attributes. It can be displayed in matrix form, and its rows and
columns extracted using matrix indexing as seen in section 3.1.
A data frame can be created by importing a csv file, or with the data.frame() function.
Let’s define a variable containing the FVII levels (in some arbitraty units),
fvii <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56,
61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,
59, 46, 58, 43)
Then we can create a dataframe with the factor variable containing the disease types and the FVII vector
containing the expression levels for an individual in eachdisease type:
blood <- data.frame(Disease=diseasef, FVII=fvii)
# Use head() or tail () for displaying the first/last elements
head(blood)
## Disease FVII
## 1 Hemophilia A 60
## 2 Thrombocytopenia 49
## 3 Ehlers-Danlos syndrome 40
## 4 Hemophilia 61
## 5 Ehlers-Danlos syndrome 64
## 6 Thrombocytopenia 60
The head() function displays just the first six rows, see also the tail() function. Remember that the
elements are retrieved through the $ operator.
4. I/O
An important part of R deals with reading data from files. The easiest way to read a file is through the
read.table() function. This function will read a .csv file and generate a data frame in the R session. Other
functions in the same family are read.csv() and read.delim().
Be advised that reading very large data files with read.table() is resource expensive and non-optimal. For
large file others lower level primitives should be used as scan()
There are tools for interactively editing objects like data frames. This is useful for making small changes to
the data in the R session.
An object editor can be invoked with the edit or fix function names, e.g. blood \textless- edit(blood),
which is equivalent to execute fix(blood).
An example using read.table() is presented bellow
myDF <- read.table('iris.csv') #read file 'iris.csv' on the current workspace
str(myDF) # describe the structure of myDF
8
## $ V1: chr "sepal_length" "5.1" "4.9" "4.7" ...
## $ V2: chr "sepal_width" "3.5" "3.0" "3.2" ...
## $ V3: chr "petal_length" "1.4" "1.4" "1.3" ...
## $ V4: chr "petal_width" "0.2" "0.2" "0.2" ...
## $ V5: chr "class" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...
myDF <- read.table('iris.csv', sep = ',', head = TRUE) #read file by setting the field
#separtor character as ',' and header as TRUE
str(myDF)
5.Control statements
5.1. Conditional
The language has available a conditional construction of the form
if ( 1==2 ) { print("yes") } else { print("no") }
## [1] "no"
The conditional expression must evaluate to a single logical value. Comparison operators are typically
&&,||,>=,>,<,<=,==,!=, whereas &,| operators applies element-wise to vectors.
There is a vectorized version of the if/else construct, the ifelse() function. This has the form
ifelse(condition, a, b) and returns a vector of the length of its longest argument, with elements a[i] if
condition [i] is true, otherwise b[i].
5.2 Loops
Loops are quite similar to other programming languages. There is also a for loop construction which has the
form
for ( ic in c("joan","helena","maria") ) { print(ic) }
## [1] "joan"
## [1] "helena"
## [1] "maria"
For loops are found much less often than in compiled languages, as R provides with some compact forms for
object iteration, like the apply(), sweep(), mapply(), tapply(), and others.
There also some other directives like repeat(), or while()
9
k<-3; while ( k ) { print(k <- k-1) }
## [1] 2
## [1] 1
## [1] 0
6. Defining functions
As hinted before, R allows the user to create functions, this is a way to expand the functionality of R towards
our interest. The definition of a simple function can be seen in the following below:
FunctionName <- function(x,y) { x+y }
FunctionName(3,4)
## [1] 7
Another example is the following function, that implements the following expression,
k−1
X
f (k) = x
x=1
## [1] 3
SerialSum(5)
## [1] 10
Important!! Note that the return value of the function is the result of the last expression in the function.
Once a function is defined, it is easy to check its definition just typing the name of the function, e.g.
SerialSum
## function(k)
## {
## out <- 0
## while (k)
## out <- out + ( k <- k-1 )
## out
## }
## <bytecode: 0x000002b6f3ab6af8>
There is also the possibility to define binary operators through the following syntax.
> "%!%" <- function(X, y) { ... }
10
6.1.apply() family functions and friends
In R there is a family of iterators known as the apply() family. This set of functions allow to do most of the
work when an iterator is needed, avoiding in most cases the use of a for() function.
Let’s define with a function definition, its really easy to iterate your own function over a vector, such as:
myfun <- function(x) sqrt(x)*x
sapply(1:10, myfun)
## 1 2 3 4 5 6 7 8 9 10
## Min. 0.20 0.200 0.200 0.200 0.20 0.400 0.300 0.200 0.200 0.10
## 1st Qu. 1.10 1.100 1.025 1.175 1.10 1.375 1.125 1.175 1.100 1.15
## Median 2.45 2.200 2.250 2.300 2.50 2.800 2.400 2.450 2.150 2.30
## Mean 2.55 2.375 2.350 2.350 2.55 2.850 2.425 2.525 2.225 2.40
## 3rd Qu. 3.90 3.475 3.575 3.475 3.95 4.275 3.700 3.800 3.275 3.55
## Max. 5.10 4.900 4.700 4.600 5.00 5.400 4.600 5.000 4.400 4.90
• EX1. Please try to explain the function calls and the output generated.
A very usefull function is the split() function, where we can retrieve the strata of a dataframe given an
input factor. The output is a list with each strata in each element, named as the levels of the factor. An
example with the iris dataset is:
iris.strata <- split(iris,iris$Species)
length(iris.strata)
## [1] 3
names(iris.strata)
11
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
## Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
The by() function expands the functionality of the split() function so we can apply methods to each strata:
by(iris[,1:2],iris$Species,summary)
## iris$Species: setosa
## Sepal.Length Sepal.Width
## Min. :4.300 Min. :2.300
## 1st Qu.:4.800 1st Qu.:3.200
## Median :5.000 Median :3.400
## Mean :5.006 Mean :3.428
## 3rd Qu.:5.200 3rd Qu.:3.675
## Max. :5.800 Max. :4.400
## ------------------------------------------------------------
## iris$Species: versicolor
## Sepal.Length Sepal.Width
## Min. :4.900 Min. :2.000
## 1st Qu.:5.600 1st Qu.:2.525
## Median :5.900 Median :2.800
## Mean :5.936 Mean :2.770
## 3rd Qu.:6.300 3rd Qu.:3.000
## Max. :7.000 Max. :3.400
## ------------------------------------------------------------
## iris$Species: virginica
## Sepal.Length Sepal.Width
## Min. :4.900 Min. :2.200
## 1st Qu.:6.225 1st Qu.:2.800
## Median :6.500 Median :3.000
## Mean :6.588 Mean :2.974
## 3rd Qu.:6.900 3rd Qu.:3.175
## Max. :7.900 Max. :3.800
Retrieve the fvii and diseasef variables defined in section 2.5. It is easy to calculate the mean concentration
in blood for each disease,i.e. using the function by() (please look at the help of the function).
by(fvii, diseasef, mean)
12
## diseasef: Hhemophilia A
## [1] 59
## ------------------------------------------------------------
## diseasef: Myeloproliferative
## [1] 58
## ------------------------------------------------------------
## diseasef: Thrombocytopenia
## [1] 53.22222
## ------------------------------------------------------------
## diseasef: Vasculitis
## [1] 54.83333
• Ex2: Compute the same values through the tapply() function.
• Ex3: Can you explain the function of tapply()?
A powerful function is the aggregate() function, which accepts the R formula interface for easy computations:
aggregate( . ~ Species, iris, mean)
7. Graphical Output
The most frequently used plotting function is the plot() function. This is a generic function that will behave
differently depending on the type or mode of the object in the first argument.
Any graphical output will be diverted to the current graphics device. A graphic device can be just a window
on the X11 system or windows system or a file like a pdf. Please take a while to look at help(Devices).
Standard devices in a GNU/Linux system are X11, jpeg, png, pdf, pictex, xfig, bitmap and postcript.
If the argument has a numeric mode or type, like the fvii variable. So with this code,
plot(fvii,col=diseasef,pch=16)
70
65
60
fvii
55
50
45
40
0 5 10 15 20 25 30
Index
we obtain figure 1.
Other standard functions for plotting includes hist(), dotchart(), image(), contour(), persp(). And
13
some a bit more low-level primitives like points(), lines(), text(), abline(), polygon(), legend(), and
others. See also the help of plot() and par() functions.
By default graphics are not interactive on the built-in capabilities of R, however additional packages (see
section 8) can be installed and activated for interactive and dynamic graphics. One of this packages is the
GGobi package by Swayne, Cook and Buja, which can be found online at http://www.ggobi.org. These plotting
libraries can be accessed from R via a package by name rggobi, described at http://www.ggobi.org/rggobi.
A nice and easy addition is the playwith package 3 .
8.Packages
Packages are sets of functions, data and documentations for specific purposes. To check which packages are
installed at your site, issue the following command
> library()
To know which packages are currently loaded, write
search()
14
# list the contents of the library
library(help = "mlbench")
Search in both libraries, the pre-instaled package datasets and in the package mlbench. Choose a dataset and:
• Describe the size and type of the dataset, including the number of variables and observations.
• Try to describe the meaning of each variable and its type.
• Try to obtain some statistical description of the dataset.
• Practice reporting your results both, numerically and graphically.
• Make sure you get ready for the upcoming questionaire!
15