R Notes
R Notes
R is a software package for statistics and graphics, which is free in two ways: free download and
free source code (see www.r-project.org). More technically, R is a language and environment for
statistical computing and graphics under the terms of the (www.gnu.org). Free Software
Foundation's GNU General Public License in source code form. The current R is the result of a
collaborative effort with contributions from all over the world. R was initially written by Robert
Gentleman and Ross Ihaka—also known as "R & R" of the Statistics Department of the
University of Auckland. Since mid-1997 there has been a core group with write access to the R
source (see www.r-project.org/contributors.html). R is similar to the S language and environment
which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John
Chambers and colleagues.
Why use R?
R has become the standard statistical software among statisticians. Consequently, new
statistical methods are often first available
There is a great deal of built-in statistical functionality and many add-on packages
available that extend the basic functionality.
R creates fine statistical graphs with relatively little effort.
R is very well designed.
R software is of very high quality.
R is easy to use.
Installation of R
R can be downloaded and installed from any of many available CRAN sites from the R
foundation website (www.r-project.org).
R console
1
When you first launch R, you will see a window that looks something like the one in the image
to the right. .
When you use the R program it issues a prompt when it expects input commands.
The default prompt is ‘>’, which on UNIX might be the same as the shell prompt. In
a command line interface, you type commands that you want to execute and press return. For
example, if you type the line 2+2 and press the return key, R will give you the result
[1] 4.
In this mode, R can be used as a very simple calculator for addition, subtraction, multiplication,
and division using the standard operators +, -, *, and /. This ability to enter commands is the
fundamental building block for using the R program. In R, a variable is a name that is assigned a
particular value. The variable names are then used in place of numbers to complete calculations.
Values can be assigned to variables using one of three operators: <-, =, and ->. You can assign
the number 5 to the variable v1 with any of the following commands.
v1 <- 5
v1 = 5
5 -> v1
Exit R
To exit R either type q( ) in the commands window or select File > Exit (that is, Exit on the File
menu). You will be asked if you wish to save the ‘workspace image’: a ‘no’ answer is
appropriate unless you have created some objects that you wish to use again.
R is case sensitive - so be very careful in the use of upper and lower case.
/ - forward slash is used in all path names (as opposed to the backward slash ‘\’).
‘ and “ (single and double quotes) are used interchangeably as long as they are paired.
2
() refers to functions and contains the arguments of the corresponding function.
[] refers to indexing and references row and/or column elements of a data structure.
There are many built-in data files in R. You can type data( ) to see such data files. When you
type data( ), a new window called R data sets will appear, which includes the R data sets’ names
and a brief description for each data set. To look at the details of a specific data set, for example,
Titanic (Survival of passengers on the Titanic), you can type help(Titanic). A new window will
tell you the details of the data set you choose. For conducting statistical data analysis for the data
3
set, you have to include the data set into R through using, e.g., data(Titanic) first. If you type the
data file name, e.g. Titanic, at this time, R will reproduce the data set in the console window.
While R is an expansive language with a large number of routines already included, it doesn't
include everything, and has several specific areas of omission with respect to multivariate
analyses (e.g., no CCA). Fortunately, the core routines are easily augmented with additional
4
user-written routines which can be loaded into your copy of R. These routines are usually
provided in what R calls a ‘package’, which is a package with the routine itself, help files, often
test data, and other items as necessary. Accordingly, it's necessary to know how to load packages
to make the most of R. Under Windows OS, click on the Packages menu and scroll down to the
Load package item. This will pop up a widget listing all available packages. To load the library,
simply click on the desired library. Alternatively, to load the library includes the library name as
listed in the library function. For example, enter: library(MASS) To see a list of installed
libraries, enter: library() If the library you want is not installed, you will have to install it
yourself. Again, depending on operating system and program, the details are somewhat different.
Arithematic operators
The R language includes the usual arithmetic operators:
+ addition
- subtraction
* multiplication
/ division
ˆ or ** exponentiation
The left-pointing arrow (<-) is the assignment operator; it is composed of thetwo characters
<(less than) and -(dash or minus), with no intervening blanks,and is usually read as gets : “The
variable x gets the value c(1, 2, 3, 4)”. The equals sign (=) may also be used for assignment in
place of the arrow (<-), except inside a function call, where = is exclusively used to specify
arguments by name. Because reserving the equals sign for specification of function arguments
leads to clearer and less error-prone R code, we encourage you to use the arrow for assignment,
even where = is allowed. As the preceding example illustrates, when the leftmost operation in a
command is an assignment, nothing is printed. Typing the name of a variable, as in the second
command immediately above, causes its value to be printed.
Logical operators
Operator Description
< less than
<= less than or equal to
> greater than
5
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x|y x OR y
x&y x AND y
isTRUE(x) test if X is TRUE
# An example
x <- c(1:10)
x[(x>8) | (x<5)]
# yields 1 2 3 4 9 10
# How it works
x <- c(1:10)
x
1 2 3 4 5 6 7 8 9 10
x > 8
F F F F F F F F T T
Like most programming languages, R allows users to create variables, which are essentially
named computer memory. For example, you may store the number of species in a sample in a
variable. Variables are identified by a name assigned when they are created. Names should be
unique, and long enough to clearly identify the contents of the variable. Variable names in R are
composed of letters (a–z, A–Z), numerals (0–9), periods (.), and underscores (_), and they may
be arbitrarily long. The first character must be a letter or a period, but variable names beginning
with a period are reserved by convention for special purposes. Names in R are case sensitive; so,
for example, x and X are distinct variables. They may not start with a number, or include the
characters "$" or "_" or any arithmetic symbols as these have special meaning in R. Variables
are assigned a value in an assignment statement, which in R has the variable name to the left of a
left-pointing arrow (typed with the "less than" followed by a "dash") with the value behind the
arrow. For example,
Age<-2
Notice that real or floating point numbers can be entered with just a decimal point, or in
exponential notation, where 1.0e-10 means .0000000001. Notice also that character variables,
called "strings" should be entered in quotes (single or double, it doesn't matter as long as they
match). Finally, note that the word TRUE is not surrounded by quotes. This is not the WORD
TRUE, but rather the VALUE TRUE. Logical variables can only take the values TRUE or
FALSE. Unlike many programming languages (e.g. FORTRAN or C) you do not have to tell R
what kind of value (integer, real, or character) a variable will contain; it can tell when the
variable is assigned. R will only allow the appropriate operations to be performed on a variable.
For example: name + 37 allow us to add 37 to name because species.name was a character
variable.
Data Structures
6
R is a 4th generation language, meaning that it includes high-level routines for working with data
structures, rather than requiring extensive programming by the analyst. In R there are 4 primary
data structures we will use repeatedly.
1. Vectors --- vectors are one-dimensional ordered sets composed of a single data type. Data
types include integers, real numbers, and strings (character variables).
2. Matrices --- matrices are two dimensional ordered sets composed of a single data type,
equivalent to the concept of matrix in linear algebra.
3. data frames --- data frames are one to multi-dimensional sets, and can be composed of
different data types (although all data in a single column must be of the same type). In
addition, each column and row in a data frame may be given a label or name to identify
it. Data frames are equivalent to a flat file database, and similar to spreadsheets.
Accordingly, we often refer to specific columns in a data frame as "fields."
4. Lists --- lists are compound objects of associated data. Like data frames, they need not
contain only a single data type, but can include strings (character variables), numeric
variables, and even such things as matrices and data frames. In contrast to data frames,
lists items do not have a row-column structure, and items need not be the same length;
some can be a single values, and others a matrix.
7
the length of the longest vector. In particular a constant is simply repeated. So with
the above assignments the command;
> v <- 2*x + y + 1
generates a new vector v of length 11 constructed by adding together, element by
element, 2*x
repeated 2.2 times, y repeated just once, and 1 repeated 11 times. The elementary
arithmetic operators are the usual +, -, *, / and ^ for raising to a power. In addition
all of the common arithmetic functions are available. log, exp, sin, cos, tan, sqrt and
so on, all have their usual meaning. max and min select the largest and smallest
elements of a vector respectively. range is a function whose value is a vector of
length two, namely c(min(x), max(x)). length(x) is the number of elements in x, sum(x)
gives the total of the elements in x, and prod(x) their product. Two statistical functions are
mean(x) which calculates the sample mean, which is the same as sum(x)/length(x), and var(x)
which gives sum((x-mean(x))^2)/ (length(x)-1) or sample variance. If the argument to var() is an
n-by-p matrix the value is a p-by-p sample covariance matrix got by regarding the rows as
independent p-variate sample vectors. sort(x) returns a vector of the same size as x with
the elements arranged in increasing order; however there are other more flexible
sorting facilities available.
(b) Matrix
A matrix can be created by simply binding together two or more vectors of the same type and
length. For example, if we create a second demo.vector
We can then bind the two vectors together using the cbind() function to create a matrix
demo.matrix<-cbind(demo.vector1, demo.vector2)
demo.matrix<-matrix(c(1,4,2,6,12,4,2,1,2,4),byrow=F,nrow=5,ncol=2)
Matrices are specified in the order "row, column", so that demo.matrix[4,2] represents the
element at row 4 and column 2 in matrix demo.matrix. Individual rows or columns within a
matrix can be referred to by implied subscript, where the value of the desired row or column is
specified, but other values are omitted. For example, demo.matrix[,2] represents the second
column of matrix demo.matrix, as the row number before the comma was omitted. Similarly,
demo.matrix[5,] # represents row 5 of matrix demo.matrix, as the column after the comma was
omitted.
Matrix multiplication
8
Two matrices A and B can be multiplied using A%*%B. But if we want to get the 'term by term'
or to get the product of the corresponding elements of A and B we can use A*B.
9
matrix of corresponding eigenvectors. Had we only needed the eigenvalues we could have used
the assignment:
> evals <- eigen(Sm)$values
evals now holds the vector of eigenvalues and the second component is discarded. If the
expression
> eigen(Sm)
is used by itself as a command the two components are printed, with their names.
Singular value decomposition and determinants
The function svd(M) takes an arbitrary matrix argument, M, and calculates the singular value
decomposition of M. This consists of a matrix of orthonormal columns U with the same column
space as M, a second matrix of orthonormal columns V whose column space is the row space
of M and a diagonal matrix of positive entries D such that M = U %*% D %*% t(V). D is
actually returned as a vector of the diagonal elements. The result of svd(M) is actually a list of
three components named d, u and v, with evident meanings.
If M is in fact square, then, it is not hard to see that
> absdetM <- prod(svd(M)$d)
calculates the absolute value of the determinant of M. If this calculation were needed often with
a variety of matrices it could be defined as an R function
> absdet <- function(M) prod(svd(M)$d)
after which we could use absdet() as just another R function.
Forming partitioned matrices, cbind() and rbind()
Matrices can be built up from other vectors and matrices by the functions cbind() and rbind().
Roughly cbind() forms matrices by binding together matrices horizontally, or column-wise, and
rbind() vertically, or row-wise.In the assignment
> X <- cbind(arg_1, arg_2, arg_3, ...)
the arguments to cbind() must be either vectors of any length, or matrices with the same column
size, that is the same number of rows. The result is a matrix with the concatenated arguments arg
1, arg 2, . . . forming the columns. If some of the arguments to cbind() are vectors they may be
shorter than the column size of any matrices present, in which case they are cyclically extended
to match the matrix column
size (or the length of the longest vector if no matrices are given). The function rbind() does the
corresponding operation for rows. In this case any vector argument, possibly cyclically extended,
are of course taken as row vectors. Suppose X1 and X2 have the same number of rows. To
combine these by columns into a matrix X, together with an initial column of 1s we can use
> X <- cbind(1, X1, X2)
The result of rbind() or cbind() always has matrix status.
Editing data
When invoked on a data frame or matrix, edit brings up a separate spreadsheet-like environment
for editing. This is useful for making small changes once a data set has been read. The command
> xnew <- edit(xold)
will allow you to edit your data set xold, and on completion the changed object is assigned
to xnew. If you want to alter the original dataset xold, the simplest way is to use fix(xold),
10
which is equivalent to xold <- edit(xold).
Use
> xnew <- edit(data.frame())
to enter new data via the spreadsheet interface.
11
1: 3 5 6
4: 3 5 78 29
8: 34 5 1 78
12:
Read 11 items
Suppose the data vectors are of equal length and are to be read in parallel. Further
suppose
that there are three vectors, the first of mode character and the remaining two of
mode numeric,
and the file is ‘input.dat’. The first step is to use scan() to read in the three vectors
as a list,
as follows
> inp <- scan("input.dat", list("",0,0))
The scan function is an extremely flexible tool for importing data. Unlike the read.table
function, however, which returns a data frame, the scan function returns a list or a vector. For the
what option, we use list and then list the variables, and after each variable, we tell R what type
of variable (e.g., numeric, string) it is. In the first example, the first variable is age, and we tell R
that age is a numeric variable by setting it equal to 0. The second variable is called name, and it
is denoted as a string variable by the empty quote marks.
12
> dim(Z) <- c(3,4,2)
However if h is shorter than 24, its values are recycled from the beginning again to
make
it up to size 24. As an extreme but common
example
> Z <- array(0, c(3,4,2))
makes Z an array of all zeros
(g) Data Frames
One of the most challenging tasks in data analysis is data preparation. R provides various
structures for holding data and many methods for importing data from both keyboard and
external sources. One of those structures is data frames. Data frames are the primary data
structure in R. A data frame is used for storing data tables. It is a list of vectors of equal length. A
data.frame object in R has similar dimensional properties to a matrix but it may contain
categorical data, as well as numeric. The standard is to put data for one sample across a row and
covariates as columns. On one level, as the notation will rea ect, a data frame is a list. Each
component corresponds to a variable; i.e., the vector of values of a given variable for each
sample. A data frame is like a list with components as columns of a table.
Usage
data.frame(..., row.names = NULL, check.rows = FALSE,
check.names = TRUE,
stringsAsFactors = default.stringsAsFactors())
default.stringsAsFactors()
Arguments
... these arguments are of either the form value or tag = value. Component
names are created based on the tag (if present) or the deparsed argument
itself.
row.names NULL or a single integer or character string specifying a column to be used as
row names, or a character or integer vector giving the row names for the data
frame.
check.rows if TRUE then the rows are checked for consistency of length and names.
check.names logical. If TRUE then the names of the variables in the data frame are checked
to ensure that they are syntactically valid variable names and are not
duplicated. If necessary they are adjustedso that they are.
stringsAsFactors logical: should character vectors be converted to factors? The ‘factory-fresh’
default is TRUE, but this can be changed by setting
options(stringsAsFactors = FALSE).
13
For example
> x=18:29
> y=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
We will now use R's data.frame command to create our first dataframe and store the results in
the variable village.
> village=data.frame(age=x,height=y)
> village
age height
1 18 76.1
2 19 77.0
3 20 78.1
4 21 78.2
5 22 78.8
…………….
Data frames can be accessed exactly as can matrices, but can also be accessed by data frame and
column or field name, without knowing the column number for a specific data item. For
illustration, let’s load the testbird.csv dataset by typing the following (we will come back to
importing and exporting data later):
testbird<-read.csv("D:/testbird.csv", header=TRUE)
testbird<-read.csv(file.choose(), header=TRUE)
14
where you will have to put in the correct local path to the folder containing the testbird dataset.
Because the variables types are mixed in this incoming data set (containing both numeric and
character fields), the data structure will be classed as a data frame automatically.
15
Character vectors
Character quantities and character vectors are used frequently in R, for example as
plot labels.
Where needed they are denoted by a sequence of characters delimited by the
double quote character, e.g., "x-values", "New iteration results". Character strings
are entered using either matching double (") or single (’) quotes, but are printed
using double quotes (or sometimes without quotes)
Index vectors; selecting and modifying subsets of a data set
A logical vector. In this case the index vector must be of the same length as the
vector from which elements are to be selected. Values corresponding to TRUE in the
index vector are selected and those corresponding to FALSE are omitted. For
example
> y <- x[!is.na(x)]
creates (or re-creates) an object y which will contain the non-missing values of x, in
the
same order. Note that if x has missing values, y will be shorter than x. Also
> (x+1)[(!is.na(x)) & x>0] -> z
creates an object z and places in it the values of the vector x+1 for which the
corresponding
value in x was both non-missing and positive.
> x[1:10]
selects the first 10 elements of x (assuming length(x) is not less than 10). Also
> c("x","y")[rep(c(1,2,2,1), times=4)]
(an admittedly unlikely thing to do) produces a character vector of length 16
consisting of
"x", "y", "y", "x" repeated four times.
3. A vector of negative integral quantities. Such an index vector specifies the values
to be
excluded rather than included. Thus
> y <- x[-(1:5)]
gives y all but the first five elements of x. A vector of character strings. This
possibility only applies where an object has a names attribute to identify its
16
components. In this case a sub-vector of the names vector may be used in the same
way as the positive integral labels in item 2 further above.
> fruit <- c(5, 10, 1, 20)
> names(fruit) <- c("orange", "banana", "apple", "peach")
> lunch <- fruit[c("apple","orange")]
> x[is.na(x)] <- 0
replaces any missing values in x by zeros and
> y[y < 0] <- -y[y < 0]
has the same effect as
> y <- abs(y)
For example, testbird is a bird abundance data frame containing 32 sample plots (rows) and 10
fields (columns). The first 3 fields (columns) contain plot identifiers. The first two are character
fields (BASIN and SUB) and the third field is numeric (BLOCK). The remaining 7 fields
(columns) are numeric and contain abundances for 7 different bird species. Given this data
structure, we can perform the following:
x<-max(testbird[,5]) # assigns the maximum value of the second species (fifth column) among
all plots. Alternatively, because testbird is a data frame, we can accomplish the same thing with
the following:
x<-max(testbird$AMRO)
y<-sum(testbird[,2]) # assigns the sum of second listed species abundance in all plots to y
x<-log(testbird[,4:10]+1) # creates a new matrix called ‘x’ with all values the log of the
respective values in columns 4 through 10 in testbird (+1 to avoid log(0) which is undefined) In
addition, R supports logical subscripts, where the subscript is applied whenever the logical
function is true.
For examples,
x<-sum(testbird[,5]>1) # assigns the number of plots where the abundance of the species in
column 5 is greater than 1 (testbird[,5]>0 is evaluated as 1 (true) or 0 (false), so that the sum is of
0's and 1's).
x<-sum(testbird[,5][testbird[,5]>1]) # assigns the sum of the abundance for the species in
column 5 in plots where species in column 5 has abundance greater than 1.
x<-max(testbird[,5][testbird$BHGR==5]) # assigns the maximum abundance for the species in
column 5 for plots with the abundance of BHGR equal to 5.
Editing data
17
When invoked on a data frame or matrix, edit brings up a separate spreadsheet-like
environment for editing. This is useful for making small changes once a data set has
been read. The command
> xnew <- edit(xold)
will allow you to edit your data set xold, and on completion the changed object is
assigned to xnew. If you want to alter the original dataset xold, the simplest way is
to use fix(xold), which is equivalent to xold <- edit(xold).
Use
> xnew <- edit(data.frame())
to enter new data via the spreadsheet interface.
Missing Values
A final special case is of special note. Missing values in a vector or matrix are always a problem
in data sets. Sometimes it is best simply to remove samples with missing data, but often only one
or a few values are missing, and it's best to keep the sample in the matrix with a suitable missing
value code. Let’s assume that we have missing values in a vector. First, select the fourth column
from the testbird data frame, which contains a single missing value:
x<-testbird[,4]
To use all of the vector EXCEPT the missing value, use:
y<-x[!is.na(x)]
The R function to identify a missing value is:
is.na( )
so that to say all of a vector except missing values, we set a logical test to be true when values
are not missing. Since the R operator for ‘not’ is !, the correct test is:
!is.na( )
and to specify which vector we're testing for missing value, we put the vector in parentheses as
follows:
!is.na(x)
Accordingly, the full expression is
x[!is.na(x)]
This use of missing values is critical to R because all operations on vectors or matrices must
have the same number of elements. So, if there are missing values in any field we're using in a
calculation, the same record (row) must be omitted from all the other fields as well.
18
Functions in R
A function consists of a name and one or more parameters (or arguments) contained in
parentheses that are required to process the function. A simple function that we have already
used is the sum() function, which returns the sum of all the values present in its arguments. In its
simplest, sum() contains two arguments:
The first argument, x, is the data set (either a vector, matrix, or data frame containing all numeric
variables) you wish to sum, and the second argument indicates whether missing values should be
ignored. The default na.rm=FALSE will return NA if there are any missing values, whereas
na.rm=TRUE will ignore the missing values when calculating the sum.
In the case of the testbird data set, applying the sum function to the species abundance fields
(columns 4-10) with the default argument of na.rm=FALSE, returns the following:
sum(testbird[,4:10])
Note, there is no need to include the arguments if you wish to use the defaults provided.
Applying the sum function with the missing values argument set to TRUE, returns the following:
sum(testbird[,4:10],na.rm=TRUE)
Note, in this case the sum is across all elements in all rows and in columns 4 through 10. The
apply() function we used above is a special function that allows us to apply other functions to
each column or row of the matrix. In this case, we applied the sum() function to each species
column of the testbird data set and returned a vector of values containing the sum of abundance
for each species.
When using functions it is important to understand the arguments of the function. The arguments
of a function are all defined in the associated help file. Each function has one or more named
arguments. Some or all of the arguments may come with default values, in which case you do not
need to specify any arguments inside the () when calling the function. However, in most cases
one or more of the arguments will not have a default value and thus you must provide a value for
the argument. For example, the sum() function requires that you specify a data set (an object,
either a vector, matrix, or data frame containing all numeric variables). If you do not specify a
value for this argument, you will get an error message.
In addition, if you specify values for arguments in the order that they are given in the written
function, then the arguments do not need to be named explicitly in the function call. For
example, in the apply() function, the following two calls are equivalent:
19
apply(testbird[,4:10],2,sum)
This is because in the second call the arguments are given in the same order as expected. If,
however, you want to specify the arguments in a different order from the default, then the
argument names must be included in the function call, e.g.:
In practice, explicitly naming the arguments often is required when you want to only specify say
the first and fourth argument of the function and accept the default values for the second and
third. In this case, you do not need to name the first argument if given first in your call, but you
must name the fourth argument.
Functions are essential to working with R. You will be using functions constantly to manipulate,
summarize, analyze, and graphically display your data. For most of the things you will need to
do in this course, functions have already been written by others and you will simply need to
know how to call these functions and interpret their output. However, you can’t work long in R
without confronting the need to construct your own functions. In most cases, these will be
functions that call or make use of existing R functions, but in particular ways suited to your
applications. Throughout this course, we will make extensive use of existing R functions to
complete projects, but there may be a need or opportunity for you to create your own functions.
Any time you issue a set of commands that you anticipate having to repeat or reuse in the future,
you should consider writing a function. Although we will not go into the details of writing
functions here, you can easily review the code for a function by simply typing the function name
at the console.
20
seq(1,10,0.4) # Generate a sequence (1 -> 10, spaced by 0.4)
sequence() # Create a vector of sequences
sign(x) # Returns the signs of the elements of x
sort(x) # Sort the vector x
order(x) # list sorted element numbers of x
tolower(),toupper() # Convert string to lower/upper case letters
unique(x) # Remove duplicate entries from vector
system("cmd") # Execute "cmd" in operating system (outside of R)
vector() # Produces a vector of given length and mode
floor(x), ceiling(x), round(x), signif(x), trunc(x) # rounding functions
Sys.time() # Return system time
Sys.Date() # Return system date
getwd() # Return working directory
setwd() # Set working directory
list.files() # List files in a give directory
file.info() # Get information about files
log(x),logb(),log10(),log2(),exp(),expm1(),log1p(),sqrt() # Fairly obvious
cos(),sin(),tan(),acos(),asin(),atan(),atan2() # Usual stuff
cosh(),sinh(),tanh(),acosh(),asinh(),atanh() # Hyperbolic functions
union(),intersect(),setdiff(),setequal() # Set operations
+,-,*,/,^,%%,%/% # Arithmetic operators
<,>,<=,>=,==,!= # Comparison operators
eigen() # Computes eigenvalues and eigenvectors
sqrt(),sum()
function_name: is the function’s name. This can be any valid variable name, but you should
avoid using names that are used elsewhere in R, such as dir, function, plot, etc.
21
arg1, arg2, arg3: these are the arguments of the function, also called formals. You can write a
function with any number of arguments. These can be any R object: numbers, strings, arrays,
data frames, of even pointers to other functions; anything that is needed for the function_name
function to run.
Some arguments have default values specified, such as arg3 in our example. Arguments without
a default must have a value supplied for the function to run. You do not need to provide a value
for those arguments with a default, as the function will use the default value.
Function body: The function code between the within the {} brackets is run every time the
function is called. This code might be very long or very short. Ideally functions are short and do
just one thing – problems are rarely too small to benefit from some abstraction.
Return value: The last line of the code is the value that will be returned by the function. It is not
necessary that a function return anything, for example a function that makes a plot might not
return anything, whereas a function that does a mathematical operation might return a number, or
a list.
Examples:
(1)
f1 <- function(x, y) {
x+y
}
f1( 3, 4)
(2)
Another example, consider a function to calculate the two sample t-statistic,
showing “all the
steps”. This is an artificial example, of course, since there are other, simpler ways of
achieving
the same end.
The function is defined as follows:
twosamplet <- function(y1, y2) {
n1 <- length(y1); n2 <- length(y2)
yb1 <- mean(y1); yb2 <- mean(y2)
22
s1 <- var(y1); s2 <- var(y2)
s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2)
tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2))
tst}
(3)
f.good <- function(x, y) {
z1 <- 2*x + y
z2 <- x + 2*y
z3 <- 2*x + 2*y
z4 <- x/y
return(c(z1, z2, z3, z4))
}
f.good(1, 2)
(4)
intsum <- function(from=1, to=10)
{
sum <- 0
for (i in from:to)
sum <- sum + i
sum
}
intsum(3) # Evaluates sum from 3 to 10 …
intsum(to = 3) # Evaluates sum from 1 to 3 …
(5) Newton Raphson method
f<-function(x)
{return(x^3-x-1)}
f1<-function(x)
{return(3*x^2-1)}
x<-1.5
i<-1
h<- -1*f(x)/f1(x)
23
if(abs(h)<0.0001){
print(x)
}else {
NR<-function()
h<- -1*f(x)/f1(x)
x<-x+h
i<-i+1
return(x)
NR()
Control statements
Control structures commonly used in R include:
if, else: testing a condition
for: execute a loop for a fixed number of times
while: execute a loop while a condition is true
repeat: execute a loop until seeing a break
break: break the execution of a loop
next: skip an iteration of a loop
return: exit a function
(a) if … else …
if (expr_1) {
expr_2
……}else {
expr_3
…….}
The first expression should return a single logical value
24
Example: 1
a<- -2
if(a<0){
cat(a, "is a negative number")
}else{
cat(a,"is a postive number")}
Example: 2
n=32
if(n%%2!=0){
print("n is not even")
}else{
print("n is an even")}
(2) for
for (name in expr_1)
expr_2
print(i)
z<-rep(0,1000)
for(i in 1:1000){
coin<-rbinom(1,1,0.5)
if(coin==1){
z[i]=z[i]+1
}else {
25
z[i]=z[i]-1
plot(z,type="b")}
# Sample with replacement from a set of N objects until the number 15 is sampled twice
M <- 0 # M is the number of samplings required to reach the criteria in this run
matches <- 0
N<-100 # integer random value between 1 to 100 selected
repeat
{
# Keep track of total connections sampled
M <- M + 1
# Sample a new connection
p = sample(N, 1) # random sample between 0 to N selected
# Increment matches whenever we sample 15
if (p == 15)
matches <- matches + 1;
# Stop after 2 matches
if (matches == 2)
break;
}
M
(4) The while function
while (expr_1)
expr_2
Here while expr_1 is false, repeatedly evaluate expr_2. break and next statements can be
used within the loop
Example:Random walk
26
Z<-5
while(z>=3&&z<=10){
print(z)
coin<-rbinom(1,1,0.5)
if(coin==1){
z=z+1
}else {
z=z-1
}
}
plot(z)
Random Generation
runif(n, min = 1, max = 1) • Samples from Uniform distribution
rbinom(n, size, prob) • Samples from Binomial distribution
rnorm(n, mean = 0, sd = 1) • Samples from Normal distribution
rexp(n, rate = 1) • Samples from Exponential distribution
rt(n, df) • Samples from T-distribution
And others!
Plotting in R
R has a powerful graphics capability that is much of the appeal to using the system. Many of the
analyses have special plotting capabilities that allow you to plot results without storing multiple
intermediate products.
To get a quick feel for how easy it is to create plots, let’s first create a simple data set containing
three numeric variables:
y<-rnorm(50,0,1) # creates a vector of random numbers; length 50; mean 0; variance 1 Now we
can produce a simple scatter plot of x against y using the basic plot () function. Simply type:
plot(x,y) # note, a call to any of the plotting functions will automatically open up a graphics
device and display the results in that device.
We can change just about any aspect of the plot with a bewildering array of graphical controls
given as arguments to the plot function. Here are some examples for you to try:
27
plot(x,y,type=’o’) # to change the type of plot, try type=‘l’,’o’,’b’, and ‘s’
To see a complete list of plot controls, look at the help file for the par() function:
help(par)
Most or all of the par commands to control the graphics can be given as arguments to the plot
function (as above). However, it is also possible to set these graphics controls for the graphics
device being use so that all plots to that device will adopt these same controls. This is done by
issue a par() command before a plot command. Some examples are as follows:
par(mar=c(0.6,0.5,0.1,0.5)) # specifies margin size in inches (bottom, left, top, right) etc.
Of course there are many more options and these can all be specified in a single command, for
example:
par(mfrow=c(2,3),new=TRUE, mar=c(0.6,0.5,0.1,0.5))
Of course there are many different kinds of plots for displaying data. The basic plot() function is
simply a starting point. There are many different so-called “high-level” plotting functions, for
example:
28
contour() — draws contours using 3 variables
In addition, there are many so-called “low-level” plotting functions used to plot additional
elements over an existing plot (i.e., overlays). These low-level functions are always called after a
high-level command in order to supplement the high-level plot. Some examples of low-level
commands include:
abline() — draw a line in intercept and slope form across an existing plot
There is of course much more detail to plotting in R, but this should suffice for now. We will be
making extensive use of the plotting capabilities of R throughout this course.
We are using this famous (Fisher's) iris data set to illustrate the relationships between two or
more variables in a 2-dimsional plane
However, we have more than two variables of interest. A set of pairwise scatterplots (sometimes
called a draftsman plot) may be of use:
> pairs(iris)
> pairs(iris[1:4])
> pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species", pch = 21, bg = c("red", "green3",
"blue")[unclass(iris$Species)])
29
There other useful functions available. For example what does splom() do? (Look up >?splom).
> library(lattice)
> ?splom
There are facilities in R for making 3d effect scatterplots: you need to download and install an
additional library, and when you load the library you need to tell R where to find it. It is just
possible to envisage the three dimensions on the printed page.
> install.packages("scatterplot3d")
> library(scatterplot3d)
> attach(iris)
> library(rgl)
> detach(iris)
> contour(x, y, volcano, levels = seq(90, 200, by = 5), add = TRUE, col = "peru")
30
> box()
> persp(x, y, volcano, theta = 30, phi = 30, expand = 0.5, col = "lightblue", xlab = "X", ylab =
"Y", zlab = "Altitude ")
Lab Exercise
x2 12 8 6 4 10
x3 3 4 0 2 1
a. Create 3 vectors
x 1 , x2 , and x 3 .
[ ]
9 12 3
2 8 4
X= 6 6 0
5 4 2
b. Construct a matrix
8 10 1
eda<-function(x)
31
par(mfrow=c(2,3))
qqnorm(x)
qqline(x)
boxplot(x)
title("Box Plot")
hist(x,main="Histogram")
iqd<-summary(x)[5]-summary(x)[2]
plot(density(x,width=2*iqd),xlab="x",ylab="",type="l")
ts.plot(x)
acf(x)
invisible()
eda(trackrecords$marathon_min)
{return(1/(1+x^2))}
h<-0.1
a<-0
b<-1
x<-seq(a,b, by=h)
y<-f(x)
n<-(b-a)/h
coef=c(rep(1,1),rep(2,n-1),rep(1,1))
I=(h/2)*(sum(coef*y))
{return(1/(1+x^2))}
n=100
32
if(n%%2!=0){
}else{
a<-0
b<-1
h<-(b-a)/n
x<-seq(a,b, by=h)
y<-f(x)
coef=c(rep(1,1),rep(c(4,2),(n-2)/2),rep(c(4,1),1))
I=(h/3)*(sum(coef*y))
The following proof shows that Box -Muller transformation produces independent normal
random variables.
=> f Z ( z ) =f Z ( Z 1 )∗f Z ( Z 2 )
1 2
This shows that the density of Z is the product of two independent standard normal variables
33
Application of accept reject method to Normal random number generation
If we want to generate X ~ σZ+µ, where Z denotes a rv with the N(0,1) distribution. Thus it
suffices to find an algorithm for generating Z ~ N(0,1). Moreover, if we can generate from the
absolute value |Z|, then by symmetry we can obtain our Z by independently generating a rv S (for
sign) that is ± 1with probability 0.5 and setting Z = S*|Z|. In other words we generate a S=U and
set Z = -|Z| if U< 0.5 and Z = |Z| if U ≥ 0.5. The density of |Z| is
2 −x /2 2
f ( x )= e , x≥ 0
√2 π
For the instrumental density we take g(x) = e− x , x > 0, the exponential density with rate 1,
something we already know how to easily simulate using inverse transform method.
f (x ) 2
Now h(x) = = e x− x /2 √ 2/π
g(x)
f (x )
If we can find a maximum value M for h(x) such that ≤ M , we can say that
g(x)
f (x)≤ Mg ( x ). Therefore, we simply use calculus to compute its maximum (solve h' (x)
= 0); which must occur at that value of x which maximizes the exponent x−x 2 /2;
namely at value x = 1. Therefore M= √ 2 e/ π
f ( y) 2 2 2
Further, = e y− y /2 √ 2/ π / √ 2 e/ π = e y−1− y / 2= e−( y−1 ) / 2
Mg ( y )
f ( y) 2
=> U ≤ means that U ≤ e−( y−1 ) / 2
Mg ( y )
1. Generate Y with an exponential distribution at rate 1; that is generate U and set Y = -ln(U)
2. Generate another U
2
3. If U ≤ e−( y−1 ) / 2, set |Z| = Y; otherwise go back to 1
4. Generate another U. Set Z= -|Z| if U ≤ 0.5 and set Z=|Z| if U > 0.5
34
2
Note: U ≤ e−( y−1 ) / 2 only if -log(U) ≥ ( y−1 )2 /2 and since -log(U) is exponential with rate 1, we can
simplify the above algorithm as
1. Generate Y1 and Y2 with exponential distribution at rate 1; that is generate U1 and U2 and set
Y1 = -ln(U1), Y2 = -ln(U2)
2
2. If Y2 ≥ ( Y 1−1 ) /2 , set |Z| = Y1; otherwise go back to 1
3. Generate another U. Set Z= -|Z| if U ≤ 0.5 and set Z=|Z| if U > 0.5
#############################################################################
if(u3<0.5)
else
35
else
#############################################################################
g <-function(x)
x^3-2*x-5
derg <- function(x)
3*x^2-2
newton2 <- function(fun, derf, x0, eps, nlim){
iter <- 0
repeat{
iter <- iter+1
if(iter > nlim){
cat(" Iteration Limit Exceeded: Current = ",iter,fill = T)
x1 <- NA
break
}
x1 <- x0 - fun(x0)/derf(x0)
if(abs(x0 - x1) < eps||abs(fun(x1))<1.0e-12)
break
x0 <- x1
cat("******Iter. No: ", iter, " Current Iterate =", x1,fill=T)
}
return(x1)
}
newton2(g,derg,2.0,.00001,100)
################################
36
iteration <- function(f,x0,tol=0.0000001){
x <- x0
while(abs(f(x)-x)>tol) x <- f(x)
return(x)}
NR <- function(f,f1,x0,tol=0.000001){
x <- x0
delta <- f(x)/f1(x)
while(abs(delta)>tol){
x <- x-delta
delta <- f(x)/f1(x)}
return(x)}
##Problem 1
# Find the real root of the equation x^6 - x^4 - x^3 - 1 = 0
# between 1.4 and 1.5
37
NR(f=f2,f1=f21,x0=1.45)
h <- (b-a)/n
x <- seq(a, b, by = h)
y <- sapply(x, f)
38
T <- h*(y[1]/2 + sum(y[2:n]) + y[n+1]/2)
return(T)
f1 <- function(x)
return(4 * x^3)
trapezoid(f1, 0, 1, n = 200)
######################################################################
h <- (b-a)/n
f1 <- sapply(x1, f)
f2 <- sapply(x2, f)
return(S)
simpson(f1, 0, 1, 20)
##########################################
39
trapezoid <- function(fun, a, b, n=100) {
h <- (b-a)/n
y <- fun(x)
return(s)
f1 <- function(x)
return(4 * x^3)
###################################
h <- (b-a)/n
if (n == 2) {
} else {
40
s <- f(x[1]) + f(x[n+1]) + 2*sum(f(x[seq(3,n-1,by=2)])) + 4 *sum(f(x[seq(2,n, by=2)]))
s <- s*h/3
return(s)
f1 <- function(x)
return(4 * x^3)
simpson(f=f1, 0, 1, n = 99)
####################################
Simpson 3/8
h <- (b-a)/n
if (n == 3) {
} else {
+2*sum(fun(x[seq(4,n, by=3)]))
s <- s*3*h/8
return(s)
f1 <- function(x)
41
return(4 * x^3)
simpson3_8(fun=f1, 0, 1, n = 99)
Ab<-matrix(c(2,4,6,2,5,9,6,3,1,0,4,8),nrow=3)
n<-nrow(Ab)
Ab[1,]<-Ab[1,]/Ab[1,1]
Ab[2,]<-Ab[2,]-Ab[2,1]*Ab[1,]
Ab[3,]<-Ab[3,]-Ab[3,1]*Ab[1,]
Ab[2,]<-Ab[2,]/Ab[2,2]
Ab[3,]<-Ab[3,]-Ab[3,2]*Ab[2,]
Program
#####################################
Ab<-matrix(c(2,4,6,2,5,9,6,3,1,0,4,8),nrow=3)
m<-nrow(Ab)
n<-ncol(Ab)
for(j in 1:(m-1)){
Ab[j,]<-Ab[j,]/Ab[j,j]
for (i in (j+1):m)
Ab[i,]<-Ab[i,]-Ab[i,j]*Ab[j,]
Ab[m,]<-Ab[m,]/Ab[m,m]
for(j in 2:m){
42
for(i in 1:(j-1))
Ab[i,]<-Ab[i,]-Ab[j,]*Ab[i,j]}
x<-rep(0,m)
for (i in 1:m)
x[i]<-Ab[i,n]
###############################################
MLE of Cauchy
n=100
u<-runif(100)
x<-tan(pi*(u-(1/2)))
logc<-function(theta,x) sum(-log(pi)-log(1+(x-theta)^2))
plot(theta,sapply(theta,logc,x),type="l",ylab="LogLof Cauchy")
###
g <-function(theta)
sum(2*(x-theta)/(1+(x-theta)^2))
sum((2*(x-theta)^2-2)/(1+(x-theta)^2)^2)
iter <- 0
repeat{
x1 <- NA
break
43
}
x1 <- x0 - fun(x0)/derf(x0)
break
x0 <- x1
return(x1)
newton2(g,derg,2.0,.00001,100)
gg <- function(theta,x)sum(2*(x-theta)/(1+(x-theta)^2))
sum((2*(x-theta)^2-2)/(1+(x-theta)^2)^2)
x<-rcauchy(100)
#alternatively
optimize(function(theta)
optimize(logc,interval=c(0,10),,,maximum=T,,x)
#Scaling
repeat{
i <- i+1
44
}
return(sum)
f<-function(x)
{return(1/(1+x^2))}
h<-0.1
a<-0
b<-1
x<-seq(a,b, by=h)
y<-f(x)
n<-(b-a)/h
coef=c(rep(1,1),rep(2,n-1),rep(1,1))
I=(h/2)*(sum(coef*y))
f<-function(x)
{return(1/(1+x^2))}
45
n=100
if(n%%2!=0){
}else{
a<-0
b<-1
h<-(b-a)/n
x<-seq(a,b, by=h)
y<-f(x)
coef=c(rep(1,1),rep(c(4,2),(n-2)/2),rep(c(4,1),1))
I=(h/3)*(sum(coef*y))
46