Discussion 3 Supervised
Discussion 3 Supervised
Discussion 3 Supervised
Steve Horvath
E-mail: shorvath@mednet.ucla.edu
http://www.ph.ucla.edu/biostat/people/horvath.htm
All analyses were done within the R statistical analyses software. R links: http://www.r-
project.org/ for general information, and http://cran.r-project.org/ for downloading.
5) RF (Random forest)
R package: randomForest
function: randomForest
The underlying Fortran code is from Leo Breiman
6) Error estimation:
cv-10 (10-fold cross-validation); .632+
Package: ipred, which requires packages mlbench, survival, nnet, mvtnorm.
mvtnorm.ipred which provides very convenient wrappers to various statistical methods.
1
Download the relevant libraries as follows:
i) click button “packages” on the R session bar
ii) choose “Install packages from cran..” Hint: the computer needs has to be
connected to the internet.
iii) To find out the contents of a library, type help(package="ipred")
iv) read the libraries into the R session by using the library() command, see
below.
R SESSION
library(MASS)
library(class)
library(rpart) # recursive partitioning, tree predictors....
library(tree)
library(e1071)
library(randomForest)
library(mlbench);library(survival); library(nnet); library(mvtnorm)
library(ipred)
# the followin function takes a table and computes the error rate.
# it assumes that the rows are predicted class outcomes while the #columns are observed
#(test set) outcomes
rm(misclassification.rate)
misclassification.rate=function(tab){
num1=sum(diag(tab))
denom1=sum(tab)
signif(1-num1/denom1,3)
}
2
# RPART (tree analysis)
rp1=rpart(factor(y)~x1+x2+x3+x4+x5,data=dat1)
plot(rp1)
text(rp1)
x1< 1.421
|
1 2
summary(rp1)
Call:
rpart(formula = factor(y) ~ x1 + x2 + x3 + x4 + x5, data = dat1)
n= 50
CP nsplit rel error xerror xstd
1 0.64 0 1.00 1.36 0.1319394
2 0.01 1 0.36 0.40 0.1131371
3
# Let us now eliminate the signal variable!!!
# further we choose 3 fold cross-validation and a cost complexity parameter=0
rp1=rpart(factor(y)~x2+x3+x4+x5,control=rpart.control(xval=4, cp=0), data=dat1)
plot(rp1)
text(rp1)
x4< 2.641
|
x3< 1.883
2
1 2
Note that the above tree overfits the data since x4 and x5 have nothing to do with y!
From the following output you can see that the cross-validated relative error rate is 1.28,
i.e. it is worth than the naive predictor (stump tree), that assigns each observation the
class 1.
summary(rp1)
summary(rp1)
Call:
rpart(formula = factor(y) ~ x2 + x3 + x4 + x5, data = dat1, control =
rpart.control(xval = 4,
cp = 0))
n= 50
CP nsplit rel error xerror xstd
1 0.20 0 1.00 1.12 0.1403994
2 0.12 1 0.80 1.24 0.1372880
3 0.00 2 0.68 1.28 0.1357645
ETC
4
# let us cross-tabulate learning set predictions versus true learning set outcomes:
tab1=table(predict(rp1,newdata=dat1,type="class"),dat1$y)
tab1
1 2
1 18 10
2 7 15
misclassification.rate(tab1)
[1] 0.34
# Note the error rate is unrealistically low, given that the predictors have nothing to do
# with the outcome. This illustrates that the “resubstitution” error rate is biased.
# Now let’s cross-tabulate the test set predictions with the test set outcomes:
tab1=table(predict(rp1,newdata=dattest,type="class"),dattest$y)
tab1
> tab1
1 2
1 34 26
2 20 20
misclassification.rate(tab1)
[1] 0.46
# this test set error rate is realistic given that the predictor contained no information.
5
#Linear Discriminant Analysis
dathelp=data.frame(x1,x2,x3,x4,x5)
Group means:
x1 x2 x3 x4 x5
1 0.9733358 1.474684 1.450246 1.405641 1.491884
2 2.0817099 1.580361 1.604800 1.649404 1.563162
# resubstitution error
tab1=table(predict(lda1)$class,y)
tab1
misclassification.rate(tab1)
> tab1
y
1 2
1 19 6
2 6 19
> misclassification.rate(tab1)
[1] 0.24
6
data(iris)
### parameter values setup
cv.k = 10 ## 10-fold cross-validation
B = 100 ## using 100 Bootstrap samples in .632+ error estimation
C.svm = 10 ## Cost parameters for svm, needs to be tuned for different datasets
7
#k-nearest neighbor predictors#
#Currently, there is an error in the underlying wrapper code for "knn" in package ipred.
#The error is due to the name conflict of variable "k" used in the wrapper function
#"ipredknn" and the original function "knn".
# We need to change variable "k" to something else (here "kk") to avoid conflict.
## .632+
errorest(Species ~ ., data=iris, model=ipredknn, estimator="632plus",
est.para=control.errorest(nboot=B), predict=bwpredict.knn, kk=1)$err
[1] 0.04141241
# Note that the k=3 nearest neighbor predictor leads to lower error rates
# than the k=1 NN predictor.
8
# Use function "bagging" in package "ipred" which calls "rpart" for classification.
## The error returned is out-of-bag estimation.
bag1=bagging(Species ~ ., data=iris, nbagg=B, control=rpart.control(minsplit=2, cp=0,
xval=0), comb=NULL, coob=TRUE, ns=dim(iris)[1], keepX=TRUE)
> bag1
Bagging classification trees with 100 bootstrap replications
# The following tables lists the out-of bag estimates versus observed species
table(predict(bag1),iris$Species)
setosa versicolor virginica
setosa 50 0 0
versicolor 0 46 5
virginica 0 4 45
## .632+
errorest(Species ~ ., data=iris, model=svm, estimator="632plus",
est.para=control.errorest(nboot = B), cost=C.svm)$error
[1] 0.03428103
9
Chapter 3: How to filter genes and use filtered genes in kNN
predictors
Let’s first install bioconductor as follows.
1) Copy and paste the following functions into your R session.
getBioC <- function (libName = "default", relLevel = "release",
destdir, versForce=TRUE,
verbose = TRUE, bundle = TRUE,
force=TRUE, getAllDeps=TRUE, method="auto") {
## !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
## !!! Always change version number when updating this file
getBioCVersion <- "1.2.52"
writeLines(paste("Running getBioC version ",getBioCVersion,"....\n",
"If you encounter problems, first make sure that\n",
"you are running the latest version of getBioC()\n",
"which can be found at: www.bioconductor.org/getBioC.R",
"\n\n",
"Please direct any concerns or questions to",
" bioconductor@stat.math.ethz.ch.\n",sep=""))
## !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
if (compareVersion(rVers,MINIMUM ) < 0)
{
stop(paste("\nYou are currently running R version ",rVers,
", however R version ",MINIMUM,
" is required.",sep=""))
}
if (length(destdir) > 1)
stop("Invalid destdir parameter, must be of length 1")
if (file.access(destdir,mode=2) < 0)
stop(paste("You do not have write access to",destdir,
"\nPlease check your permissions or provide",
"a different 'destdir' parameter"))
if (relLevel == "devel")
optReps <- curOps[c("BIOCDevel","BIOCData",
"BIOCCourses","BIOCcdf",
"BIOCprobes","CRAN",
"BIOCOmegahat")]
else
optReps <- curOps[c("BIOCRel1.3","BIOCData",
"BIOCCourses","BIOCcdf","BIOCprobes",
"CRAN", "BIOCOmegahat")]
options(repositories2=optReps)
syncLocalLibList(destdir, quiet=TRUE)
if (length(installPkgs) > 0)
statusList(out) <- install.packages2(installPkgs, bioCEntries, lib=destdir,
type = ifelse(PLATFORM == "unix", "Source",
"Win32"), versForce=versForce,
recurse=FALSE,
getAllDeps=getAllDeps,
method=method,
force=force, searchOptions=TRUE)
}
if (length(updated(out)) == 0)
print("All requested packages are up to date")
else
print(out)
## If they are using 'default', alert the user that they have not
## gotten all packages
if (libName == "default") {
out <- paste("You have downloaded a default set of packages.\n",
"If you wish to see other download options, please",
" go to the URL:\n",
"http://www.bioconductor.org/faq.html#getBioC\n",
sep="")
cat(out)
}
if(inherits(tryMe, "try-error"))
stop(paste("The url:",URL,
"does not seem to have a valid PACKAGES file."))
close(con)
return(tryMe)
}
## Returns the url for some files that are needed to perform the
## functions. name is added to teh end of the URL
getReposURL <- function(relLevel, name="", bioCoption){
URL <- switch(relLevel,
"devel"= paste(bioCoption,
"repository/devel/package",
name, sep ="/"),
"releas e"=paste(bioCoption,
"repository/releas e1.3",
"/package",name,sep="/"),
character())
URL
}
## Returns the file name with the destination path (if any) attached
getFileName <- function(url, destdir){
temp <- unlist(strsplit(url, "/"))
if(is.null(destdir))
return(temp[length(t emp)])
else
return(file.path(destdir, temp[length(temp)]))
}
return(TRUE)
}
## From reposTools
GBCuserQuery <- function(msg, allowed=c("yes","y","no","n")) {
## Prompts the user with a string and for an answer
## repeats until it gets allowable input
repeat {
allowMsg <- paste("[",paste(allowed,collapse="/"),
"] ", sep="")
outMsg <- paste(msg,allowMsg)
cat(outMsg)
ans <- readLines(n=1)
if (ans %in% allowed)
break
else
cat(paste(ans,"is not a valid response, try again.\n"))
}
ans
}
10
2) Now activate the installation by typing
getBioC()
# Now we will use the filter functions that are described in vignette 1,
# Vignettes are pdf files that can be accessed by typing
openVignette("genefilter")
TASK1: Let’s select genes that are expressed above 500 in at least 10 samples.
# create a filter function
f1=kOverA(10,500)
# assemble the filter functions into a filtering function
ffun=filterfun(f1)
# apply the filtering function to the expression matrix
which=genefilter(dat1,ffun)
table(which)
> table(which)
which
FALSE TRUE
880 120
11
# Now we define the Anova filter.
# which filters out genes that are significantly different across the 4 tissues (p<0.01)
Afilter=Anova(tissue1,0.01)
aff=filterfun(Afilter)
which2=genefilter(dat1,aff)
table(which2)
# TASK 3: Let us now filter out genes that are significantly different across
# the tissues AND are expressed above 100 in at least 10 samples
Afilter=Anova(tissue1,0.01)
f1=kOverA(10,100)
aff=filterfun(Afilter,f1)
which2=genefilter(dat1,aff)
table(which2)
which2
FALSE TRUE
483 517
library(class)
# Here we use a new definition of a cross validation function for k-nearest neighbor
# compare it to knn.cv in the class library
rm(knnCV)
knnCV = function(EXPR, selectfun, cov, Agg, pselect = 0.01,Scale = FALSE) {
nc <- ncol(EXPR)
outvals <- rep(NA, nc)
for (i in 1:nc) {
v1 <- EXPR[, i]
expr <- EXPR[, -i]
glist <- selectfun(expr, cov[-i], p = pselect)
expr <- expr[glist, ]
if (Scale) {
expr <- scale(expr)
v1 <- as.vector(scale(v1[glist]))
}
else v1 <- v1[glist]
out <- paste("iter ", i, " num genes= ", sum(glist),sep ="")
print(out)
Aggregate(row.names(expr), Agg)
if (length(v1) == 1)
# the red number selects k=5 nearest neighbors
outvals[i] <- knn(expr, v1, cov[-i], k = 5)
else outvals[i] <- knn(t(expr), v1, cov[-i], k = 5)
}
return(outvals)
}
12
rm(gfun)
gfun <- function(expr, cov, p = 0.05,k1=5,level1=100) {
f2 <- Anova(cov, p = p)
f3= kOverA(k1,level1)
ffun <- filterfun(f2,f3)
which <- genefilter(expr, ffun)
}
misclassification.rate(tab1)
[1] 0.118
genes.used=multiget(ls(env=aggenv(Agg)),env=aggenv(Agg))
genes.counts=as.numeric(genes.used)
names(genes.counts)=names(genes.used)
sort(genes.counts)
.....
51 51 51
AFFX.HSAC07/X00351.5.at AFFX.HSAC07/X00351.M.at
AFFX.HUMGAPDH/M33197.5.at
51 51
51
AFFX.HUMGAPDH/M33197.M.at
51
The variable genes.counts contains for each gene the number of times it was
selected in the cross validation.
Homework problems 2
13
Microarray Data and Supervised Learning
Biostats 278, Steve Horvath
To understand this homework, read the corresponding discussion notes carefully.
Use the data set MicroarrayExample.csv for the following analyses.
14