0% found this document useful (0 votes)

9 views

Nazarov QC-Statistics

Uploaded by

kinjal010902

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Nazarov QC-Statistics

Uploaded by

kinjal010902

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Petr Nazarov

MISB Course Transcriptomics (Prof. Dr. Stephanie Kreis) petr.nazarov@lih.lu

2023-10-24 http://edu.modas.lu/transcript-seq
11
Outline of the Course

1. Data overview 3. Statistical basics

RNA-seq data generation Hypothesis testing (p-value)
File formats, Phred-quality T-test, Wilcoxon test
Sequence-based QC: FastQC / MultiQC Multiple testing (FDR, FWER)
Statistical properties of the data Linear models: ANOVA

2. Exploratory data analysis 4. Statistics for RNA-seq

Distributions & boxplots Differential expression analysis
Dimensionality reduction: PCA, MDS, tSNE, UMAP EdgeR, DESeq, limma
Clustering Enrichment analysis
Heatmaps for expression and correlation
Detection of outliers

Please see scripts and materials online http://edu.modas.lu/transcript-seq

22
Part 1

1. Data Overview
http://edu.modas.lu/transcript-seq/part1.html

33
1.1. RNA-seq Data Generation
read: short fragment detected by RNA-seq
library: collection of all reads from the sample
CPM: counts per million nucleotides
TPM: transcripts per million (proportion)
FPKM: fragments per kilobase of exon per million reads mapped
RPKM: reads per ……. (for single-end)

Xi – observed number of reads

N – library size
li – length of the gene (transcript)

raw normalized counts,

counts CPM, FPKM, RPKM

10-minute simple explanation of TPM / FPKM

Wang Z et al. RNA-Seq: a revolutionary tool
https://www.youtube.com/watch?v=TTUrtCY2k-w
for transcriptomics. Nat Rev Genet. 2009
44
1.2. File Formats
@HWI-ST508:152:D06G9ACXX:2:1101:1160:2042 1:Y:0:ATCACG
NAAGACCGAATTCTCCAAGCTATGGTAAACATTGCACTGGCCTTTCATCTG Link with the
Raw image files (e.g.BCL)
+ detailed
#11??+2<<<CCB4AC?32@+1@AB1**1?AB<4=4>=BB<9=>?###### explanation
FASTQ files
@HD VN:1.0 SO:coordinate
@SQ SN:seq1 LN:5000
@SQ SN:seq2 LN:5000
Mapping, @CO Example of SAM/BAM file format.
alignment B7_591:4:96:693:509 73 seq1 1 99 36M *
0 0 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
<<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7 Link with the
SAM/BAM files MF:i:18 Aq:i:73 NM:i:0 UQ:i:0 H0:i:1 detailed
H1:i:0EAS54_65:7:152:368:11373 seq1 3 99 35M *
explanation
0 0
CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
Counting <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6): MF:i:18 Aq:i:66 NM:i:0
UQ:i:0 H0:i:1 H1:i:0

Raw counts ID Gene.Symbol A1 A2 A3 A4 B1 B2

ENSG00000135899 SP110 32 31 33 33 136 136
ENSG00000154451 GBP5 0 0 0 0 395 383
ENSG00000226025 LGALS17A 0 0 0 0 217 196
Normalized counts ENSG00000213512 GBP7 0 0 0 0 44 47
CPM, TPM, RPKM… ENSG00000260873 SNTB2 198 193 195 196 483 502
ENSG00000063046 EIF4B 552 546 548 550 428 429
Advantage of RNA-seq:
ENSG00000102524 TNFSF13B you can repeat the0 pipeline0 with new
0 knowledge
0 or16questions
17 55
1.2. File Formats
@HWI-ST508:152:D06G9ACXX:2:1101:1160:2042 1:Y:0:ATCACG
NAAGACCGAATTCTCCAAGCTATGGTAAACATTGCACTGGCCTTTCATCTG
+
#11??+2<<<CCB4AC?32@+1@AB1**1?AB<4=4>=BB<9=>?######

Quality scores started as numbers (0-40) but have since changed to

an ASCII encoding to reduce filesize and make working with this
format a bit easier, however they still hold the same information.
ASCII codes are assigned based on the formula found below. This
table can serve as a lookup as you progress through your analysis.

https://learn.gencore.bio.nyu.edu/ngs-file-formats/quality-scores/
66
1.3. Sequence-based QC: FastQC
FastQC – a simple but widely-used Java-based tool for quality control of the experiments at the sequence level. It provides
a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which
you should be aware before doing any further analysis.
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

• Import of data from BAM, SAM or FastQ files (any variant)

• Providing a quick overview to tell you in which areas there
may be problems
• Summary graphs and tables to quickly assess your data
• Export of results to an HTML based permanent report
• Offline operation to allow automated generation of
reports without running the interactive application

Examples

More detailed explanation & examples:

https://scienceparkstudygroup.github.io/rna-seq-lesson/03-qc-of-sequencing-results/index.html#31-running-fastqc
77
1.3. Sequence-based QC: MultiQC
A modular tool to aggregate results from bioinformatics analyses across
many samples into a single report. Python-based
https://multiqc.info/ - see example online.

Introduction: https://www.youtube.com/watch?v=BbScv9TcaMg

88
1.4. Statistical Properties of the Data
ID Gene.Symbol A1 A2 A3 A4 B1 B2
ENSG00000135899 SP110 32 31 33 33 136 136
ENSG00000154451 GBP5 0 0 0 0 395 383
ENSG00000226025 LGALS17A 0 0 0 0 217 196
ENSG00000213512 GBP7 0 0 0 0 44 47
ENSG00000260873 SNTB2 198 193 195 196 483 502
ENSG00000063046 EIF4B 552 546 548 550 428 429
ENSG00000102524 TNFSF13B 0 0 0 0 16 17
ENSG00000107201 DDX58 79 81 82 77 296 310
ENSG00000010030 ETV7 2 2 2 0 93 85
ENSG00000125347 IRF1 22 24 27 22 234 236
ENSG00000180616 SSTR2 0 0 0 0 19 21
ENSG00000155962 CLIC2 2 2 1 1 71 65
ENSG00000153944 MSI2 55 54 54 54 37 37
ENSG00000197646 PDCD1LG2 0 0 0 0 58 60
ENSG00000108771 DHX58 5 4 4 5 26 25
ENSG00000100336 APOL4 9 8 11 8 130 135
ENSG00000182551 ADI1 88 86 88 89 59 60
ENSG00000128284 APOL3 14 14 14 13 85 94
ENSG00000153989 NUS1
same
214
condition,
216
same
212
gene
214 167 167
ENSG00000131979 GCH1 57 61 57 56 172 167

Poisson distribution Negative binomial distribution Normal distribution

Can be used for
log(1+k), when k is
large, but it is
approximate
1 parameter 2 parameters => less power
to fit () => to fit (p,r) => (still usable but may
fits biology better! miss interesting
too simple!
cases) 99
1.5. Data Repositories
GEO: http://www.ncbi.nlm.nih.gov/gds TCGA: https://tcga-data.nci.nih.gov/tcga/
~11k tumor samples
US-based repository
of omics data
Analysis via:
http://www.cbioportal.org
/public-portal/

ArrayExpress: http://www.ebi.ac.uk/arrayexpress/

EU-based repository GTEx: https://www.gtexportal.org/home/

of omics data
~17k healthy samples

1010
Take Home Messages 1
use for exam in JAN check the word document with overview of everything on
MODAS.lu
nf-core for analysis pf RNA seq data, metagenome, CHIP seq analysis, nansospre
RNA-seq can be used as row counts and normalized (TPM, FPKM). See what you need for a
specific algorithm!

For QC of your samples at the sequence level – use FastQC. To combine results - MultiQC

Expression-related data in transcriptomics are strongly right-skewed. Therefore:

For statistics use either precise distribution (negative binomial for RNA-seq)
or work with log-transformed data
Use log-transformed data for exploratory analysis and visualization

Several large repositories of the data exist. Before planning your experiments – make a
search for existing data
1111
Part 2

2. Exploratory Data
Analysis
http://edu.modas.lu/transcript-seq/part2.html

see more here: http://edu.modas.lu/modas_eda/

1212
2.1. Distributions
## download and load the data
noisy data directly from the machine
url = "http://edu.modas.lu/data/rda/LUSC60.RData"
download.file(url, destfile="LUSC60.RData",
mode = "wb")
load("LUSC60.RData")
str(LUSC)

## log transform the data and put it to X

X = log2(1+LUSC$counts)

## density plot
plot(density(X), col="blue", lwd=2)

## boxplot
boxplot(X, col="lightblue", las=2)
"normalized" data?

box plots is an old method to visualize data

## simple normalization (do not use ;) )
XN = scale(X)
reminder: boxplot definition
boxplot(XN, col="lightblue", las=2)

## try this:
plot(density(X), col="black", lwd=2)
for (i in 1:ncol(X))
lines(density(X[,i]),col="#0000FF33")

1313
2.2. Dimensionality Reduction
Each sample (object) is represented by 20 000 genes (features)… How can we visualized samples in understandable way?
 Use dimensionality reduction! check the data in a different "angle" that shows the most majority of variability
PCA - rotation of the coordinate system in multidimensional space in
the way to capture main variability in the data. check webiste that
shows it follows the
most variable data
ICA - matrix factorization method that identifies statistically
basic mathematical operations independent signals and their weights.

NMF - matrix factorization method that presents data as a matrix

product of two non-negative matrices.

LDA - identify new coordinate system, maximizing difference

between objects belonging to predefined groups (see Fig.).

MDS - method that tries to preserve distances between objects in

the low-dimension space.

AE - artificial neural network with a “bottle-neck”.

Please check some nice interactive resources online: t-SNE - an iterative approach, similar to MDS, but considering only
close objects. Thus, similar objects must be close in the new
• Principal Component Analysis Explained Visually by Victor Powell (reduced) space, while distant objects are not influencing the results.
• Understanding UMAP by Andy Coenen and Adam Pearce
UMAP - modern method, similar to t-SNE, but more stable and with
• Dimensionality Reduction for Data Visualization: PCA vs TSNE vs preservation of some information about distant groups (preserving
UMAP vs LDA by Sivakar Sivarajah topology of the data).
1414
2.2. Dimensionality Reduction: PCA
Principal component analysis (PCA) 20000 genes 
is a vector space transform used to reduce multidimensional data sets to lower dimensions
for analysis. It selects the coordinates along which the variation of the data is bigger.
2 dimensions

For the simplicity let us consider 2 parametric situation both in terms of data and resulting PCA.

Scatter plot in
Scatter plot in PC
“natural” coordinates

Second component
Variable 2

Variable 1 First component

Instead of using 2 “natural” parameters for the classification, we can use the first component!
1515
2.2. Dimensionality Reduction: PCA example 1

techincal effect, biologicall effect and time effect respectively

batch effect

1616
2.2. Dimensionality Reduction: PCA example 2
pca tries to capture differentiated data more than similar data
## download and load the data for TCGA
## lung squamous cell carcinoma patients

## download and load the data

url = "http://edu.modas.lu/data/rda/LUSC60.RData"
download.file(url, destfile="LUSC60.RData", tumor in red
mode = "wb")
load("LUSC60.RData")
str(LUSC)
## log transform the data and put it to X
X = log2(1+LUSC$counts)

##----------------------------------------------------
## exclude genes with 0 variance
X = X[apply(X,1,var)>0,]
## Run PCA on the transposed X
PC = prcomp(t(X),scale=TRUE) NT: non-tumor
str(PC) TP: tumor primary
## Visualize red dot here
plot(PC$x[,1],PC$x[,2], pch=19,
col=c(NT="blue",TP="red")[LUSC$meta$sample_type])

##---= or use my warp-up =---

source("http://r.modas.lu/plotPCA.r")
plotPCA(X, cex=1.5,
col = c(NT="#0000FF55",
TP="#FF000055")[LUSC$meta$sample_type])

##-------------------------------------------------------
## Task: plot PCA for the complete dataset More at http://edu.modas.lu/modas_eda/part3.html
url = "http://edu.modas.lu/data/rda/LUSC.RData"
1717
unsupervised machine learning 2.3. Clustering

from one coordinate to another to measure distance

1818
2.3. Clustering: k-means
exact clustering, does not overlap at all robust but stiff, k has to be defined

k-Means Clustering
k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which
each observation belongs to the cluster with the nearest mean.

1) k initial "means" (in this 4) Steps 2 and 3 are

case k=3) are randomly repeated until
selected from the data set 2) k clusters are created by
convergence has been
(shown in color). associating every
reached.
observation with the 3) The centroid of each of
nearest mean. the k clusters becomes the
new means. http://wikipedia.org 1919
2.3. Clustering: k-means
## here we will use previously calculated X from LUSC60
load("LUSC60.RData")
X = log2(1+LUSC$counts)
X = X[apply(X,1,var)>0,]

## k-means clustering
clusters = kmeans(x=t(X),centers=2,nstart=10)$cluster
## validate clusters
table(clusters,LUSC$meta$sample_type)
## get PCA results (use old if you have)
PC = prcomp(t(X),scale=TRUE)
## visualize as PCA, use colors to represent clusters
plot(PC$x[,1],PC$x[,2],col = clusters, pch=19, cex=2,
main="PCA, colored by k-means clusters")

## Exercise: do the same with standard `iris` data

View(iris)

## k-means clustering
clusters = kmeans(iris[,-5],centers=3,nstart=10)$cluster

## validate clusters
table(clusters,iris[,5])
clustering is not always optimal, it
looks like its overlapping here(pink ## get PCA results (use old if you have)
&green). PC = prcomp(iris[,-5])
we have to know number of clusters
to understand it, without context it
could be misunderstood
## visualize as PCA, use colors to represent clusters
plot(PC$x[,1],PC$x[,2],col = clusters, pch=19, cex=2,
main="PCA, colored by k-means clusters")

2020
2.3. Clustering: hierarchical
unstable when data is changed but easy to visualize

Hierarchical Clustering
Hierarchical clustering creates a hierarchy of clusters that may be represented in a tree structure called a dendrogram.
The root of the tree consists of a single cluster containing all observations, and the leaves correspond to individual
observations.
Algorithms for hierarchical clustering are generally either agglomerative, in which one starts at the leaves and successively
merges clusters together; or divisive, in which one starts at the root and recursively splits the clusters.

Elements

we can choose
where to cut the
"tree"

Agglomerative

Divisive
start w
indiviadual
objects and
grouping
them
together

Dendrogram

http://wikipedia.org
Distance: Euclidean 2121
2.3. Clustering: k-means
## here we will use previously calculated X from LUSC60

## hierarchical clustering – generate tree and show it

hc = hclust(dist(t(X)))
plot(hc) distance of data

## cut the tree to have k clusters

clusters = cutree(hc, k=3)
table(clusters, LUSC$meta$sample_type)

## visualize as PCA, use colors to represent clusters

PC = prcomp(t(X),scale=TRUE)
plot(PC$x[,1],PC$x[,2],col = clusters, pch=19, cex=2,
main = "PCA, colored by hierarchical clusters")

after cutting tree

advantage: can have many clusters

## Exercise: do the same with `iris` data

View(iris)
...

2222
to present 3D data (sample, gene) 2.4. Heatmaps
## Heatmaps. Visualize the most variable genes in LUSC60
#install.packages("pheatmap" )
library(pheatmap)

## identify the most variable genes

plot(density(apply(X,1,sd)))
ikeep = apply(X,1,sd)>3
table(ikeep)
scaling imp bcus genes are expressed on different levels,
## draw a heatmap
pheatmap(X[ikeep,],scale="row",fontsize_row=1, fontsize_col=5,
main="Top variable genes"))

## Exercise: try scale="none", select 1000+ variable genes

if we forget to scale, hard to

understand visually

lowly expressed highly expressed

nt normal tissue tp tumor primary 2323

2.4. Heatmaps: correlation for QC
## Heatmaps of correlation can be used for QC
library(pheatmap)

## calculate correlation between samples (columns)

R = cor(X, method="pearson")

nt ## draw a heatmap of correlation

pheatmap(R,fontsize_row=5, fontsize_col=5,
main="Correlation between samples")

Correlation between samples can be used to identify outliers or

swap mistakes in the experiment. In case you have outliers or
work with non-log transformed data, you could use
method = "spearman" – a non-parametric correlation.
instead of pearson

Note: see the average correlation between samples ~0.9.

Why? correlation will always be high, bcus some genes will always be higly expressed or some
TPs will always be lowly expressed

correlation 0 when 2 patients samples are compared and their difference is heatmapped

cancer tissue less similar than nt

2424
Take Home Messages 2

Always check the distribution of your data! It can help you decide about pre-processing
(log-transformation, normalization) and identify outliers.

Use PCA and correlation to identify outliers or strangely behaving samples. PCA can also
show you the effects of experimental factors

Use clustering to group your data (unsupervised approach)

k-means method is very robust but you should know the number of clusters k.
and add or change
Hierarchical clustering is quite flexible (k is variable) but not stable in case you exclude a few samples.

Heatmap is a nice tool to visualize the expression of genes over the samples.
Use a heatmap of correlations to check similarities and groups in your samples.
2525
Part 3

3. Statistical Basics
http://edu.modas.lu/transcript-seq/part3.html

see more here: http://edu.modas.lu/modas_dea/index.html

2626
3. Statistical Basics
Questions
Which genes have changes in mean expression level between conditions?
How reliable are this observations (what is your p-value or FDR?)
Differential
Expression
Analysis (DEA)
Similar to t-test with Similar to ANOVA with
Student’s statistics: Single factor, two Multifactor or Fisher’s statistics:
compare means conditions multicondition compare variances

Post-hoc analysis Example: two cell lines in time:

What are those?.. And do not forget about
multiple hypotheses testing
• hypotheses
• p-values
• FDR
• t-test
• ANOVA
2727
3.1. Hypothesis Testing
When statisticians would like to make a claim, they do this in the form of hypothesis testing. In
hypothesis testing, we begin by making a tentative assumption about a population parameter, i.e. by
formulation of a null hypothesis.

Null hypothesis
The hypothesis tentatively assumed true in the hypothesis testing procedure, H0 .
For safety reasons, we assume a situation when nothing “interesting” happens as H0

Alternative hypothesis
The hypothesis concluded to be true if the null hypothesis is rejected, Ha
Ha will be a situation when we see something unusual, which requires action

Hypotheses in a simplest case: comparing mean to a constant

One-tailed Two-tailed
H0:   const H0:   const H0:  = const
Ha:  > const Ha:  < const Ha:   const
2828
3.1. Hypothesis Testing

False Negative,
 error

False Positive,
 error

2929
3.1. Hypothesis Testing: p-value
One-tailed test
A hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its
sampling distribution

H0:   0
Ha:  < 0
A Trade Commission (TC) periodically conducts statistical studies designed to test the claims that
manufacturers make about their products. For example, the label on a large can of Hilltop Coffee
states that the can contains 3 pounds of coffee. The TC knows that Hilltop's production process
cannot place exactly 3 pounds of coffee in each can, even if the mean filling weight for the population
of all cans filled is 3 pounds per can. However, as long as the population mean filling weight is at least
3 pounds per can, the rights of consumers will be protected. Thus, the TC interprets the label
information on a large can of coffee as a claim by Hilltop that the population mean filling weight is at
least 3 pounds per can. We will show how the TC can check Hilltop's claim by conducting a lower tail
hypothesis test.

Suppose a sample of n = 36 coffee cans is selected. From the previous studies, it’s
0 = 3 lbm
known that  = 0.18 lbm
3030
3.1. Hypothesis Testing: p-value
0 = 3 lbm Suppose a sample of n = 36 coffee cans is selected and m = 2.92 is
observed. From the previous studies, it’s known that  = 0.18 lbm

H0:   3 no action

Ha:  < 3 legal action

Let’s say: in the extreme case, when =3, we would like to be 99% sure that we make
no mistake, when starting legal actions against Hilltop Coffee. It means that selected
significance level is  = 0.01

should be testes OK
3131
3.1. Hypothesis Testing: p-value
Let’s find the probability of observation m for all possible   3. We start from an extreme case
(=3) and then probe all possible  > 3. See the behavior of the small probability area around
measured m. What you will get if you summarize its area for all possible   3 ?
## Calculate p-value: one sample mean
p value is level of making the mistake
## parameters
n = 36
mu = 3

## calculation (if no data available)

m = 2.92 # m – mean from experiment
sigma = 0.18 # st.dev. known beforehand
z = (m-mu)/sigma * sqrt(n)
pnorm(z) # 0.003830381

## calculation (if data is available)

url="http://edu.modas.lu/data/txt/coffee.txt"
x = scan(url)
t.test(x, mu = mu, alternative="less")

| |
m 0
P(m) for all possible   0 is equal to P(x<m) for an extreme case of  = 0
3232
3.2. Hypothesis Testing: Multiple Testing
## Why do we need multiple testing correction?

## 1. Generate a random matrix: 1000 genes x 6 samples

X = matrix(rnorm(6*1000),nrow=1000,ncol=6)
rownames(X) = paste0("gene",1:1000)

## 2. Assume col 1,2,3 - exp, 4,5,6 - ctrl

colnames(X) = c("exp1","exp2","exp3","ctrl1","ctrl2","ctrl3")

## 3. Do a t.test for each "gene" (slow, but who cares :)

pv = NULL
for (i in 1:nrow(X))
pv[i] = t.test(X[i,1:3],X[i,4:6])$p.value

table(pv < 0.05) # around 50 false positives are expected

## do FDR adjustment
fdr = p.adjust(pv,"fdr")
table(fdr < 0.05)

3333
3.2. Multiple Testing

False Negative,
 error

False Positive,
 error

Probability of an error in a multiple test, when =0.05: 1–(0.95)number of comparisons

3434
3.2. Multiple Testing: FDR

False discovery rate (FDR)

FDR control is a statistical method used in multiple hypothesis testing to correct for multiple
comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly
rejected null hypotheses (type I errors).

Population Condition
H0 is TRUE H0 is FALSE Total
Accept H0
m–R
Conclusion

U T
(non-significant)
Reject H0
V S R
(significant)
Total m0 m – m0 m

 V 
FDR  E 
V  S 
3535
3.2. Multiple Testing: FDR and FWER
False Discovery Rate: Benjamini & Hochberg
 V 
Assume we need to perform m = 100 comparisons, FDR  E 
and select maximum FDR =  = 0.05  V  S 

k
Expected value for FDR <  if P( k )   mP( k )
m 
k
genral usage p.adjust(pv, method="fdr") Theoretically, the sign should be “≤”.
But for practical reasons it is replaced by “<“

Familywise Error Rate (FWER) not recommemded

Bonferroni – simple, but too stringent, not recommended mP(k )   adjusted p value

Holm-Bonferroni – a more powerful, less stringent but still universal FWER

in case of best
gene sureity p.adjust(pv, method="holm") m  1  k P( k )   3636
3.3. Linear Models

Many conditions If we would use pairwise comparisons, what

will be the probability of getting error?
We have measurements for 5 conditions.
5!
Are the means for these conditions Number of comparisons: C25   10
2!3!
equal?
Probability of an error: 1–(0.95)10 = 0.4

Many factors
We assume that we have several factors
affecting our data. Which factors are
most significant? Which can be
neglected?

ANOVA
example from Partek™
3737
3.3. Linear Models: ANOVA

As part of a long-term study of individuals 65 years of age or older, sociologists and physicians at the Wentworth Medical Center
in upstate New York investigated the relationship between geographic location and depression. A sample of 60 individuals, all in
reasonably good health, was selected; 20 individuals were residents of Florida, 20 were residents of New York, and 20 were
residents of North Carolina. Each of the individuals sampled was given a standardized test to measure depression. The data
collected follow; higher test scores indicate higher levels of depression.
Q: Is the depression level same in all 3 locations?

depression.txt H0: 1= 2= 3

1. Good health respondents Ha: not all 3 means are equal
Florida New York N. Carolina
3 8 10
7 11 7
7 9 3
3 7 5
8 8 11
8 7 8
… … …

3838
3.3. Linear Models: ANOVA

14 H0: 1= 2= 3

12 Ha: not all 3 means are equal

10 Please see the code and explanation online:

m2
Depression level

http://edu.modas.lu/transcript-seq/part3.html
8 m3
## load data (*)

6
m1 Dep = read.table("depression2.txt",
header=T, sep="\t", as.is=FALSE)
str(Dep)

4 ## run 1-factor ANOVA

DepGH = Dep[Dep$Health == "good",]
res1 = aov(Depression ~ Location, DepGH)
summary(res1)
2
TukeyHSD(res1)

## run 2-factor ANOVA

0 res2 = aov( Depression ~
NY
NY
NY
NY
NY
NY
NY
NC
NC
NC
NC
NC
NC
FL
FL
FL
FL
FL
FL
FL

Location + Health + Location*Health,

Dep)
Measures summary(res2)
TukeyHSD(res2)

(*) http://edu.modas.lu/data/txt/depression2.txt
3939
Take Home Messages 3

When doing multiple hypothesis testing and selecting only those elements
which are significant – always use FDR (or other, like FWER) correction! note for exams

the simplest correction – multiply the p-value by the number of genes. Is it still
significant? Use FDR (Benjamini-Hochberg) or FWER (Holm)

DEA detects the genes which have changed mean gene expression
between condition
=> The more data you have, the smaller differences you will be able to see

Several factors can be taken into account in ANOVA approach. This will give
you insight into the significance of each experimental factor but at the same
time will correct batch effects and allow you to answer complex questions
(remember shoes affecting ladies…).

4040
Part 4

4. Statistics for RNA-seq

http://edu.modas.lu/transcript-seq/part4.html

see more here: http://edu.modas.lu/modas_dea/index.html

4141
4.1. Linear Models for Transcriptomics Data

Yij = µi + Aj + Bj + Aj∗Bj + ϵij i – gene index

j – sample index
avg gene exp

Aj∗Bj – effect which cannot be explained by superposition A and B

Limma – R package for DEA in microarrays or RNA-seq based on linear models. less sensitive but fast
It is similar to t-test / ANOVA but uses all available data for variance estimation, thus it has
higher power when the number of replicates is limited. It assumes a normal distribution of
values for the gene between replicates. Apply it to normalized, log-transformed counts.

edgeR – R package for DEA in RNA-Seq, based on linear models and negative binomial
distribution of counts. Apply to raw counts! generally

Better noise model results in higher power detecting differentially expressed genes. It assumes
a negative-binomial distribution of values for the gene between replicates.

DESeq2 – another R package for DEA in RNA-Seq, based on the negative binomial distribution
of counts. DESeq2 is the most sensitive among others. Apply to raw counts!
Better noise model results in higher power detecting differentially expressed genes. It assumes
a negative-binomial distribution of values for the gene between replicates.
4242
4.2. DEA: Preparing for limma, edgeR, DESeq2

## install packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
Here we should define contrasts: BiocManager::install("limma")
"condition1 – condition2" BiocManager::install("edgeR")
BiocManager::install("DESeq2")
condition 1 – experimental group
## if you wish, you can use my simple warp-up
condition 2 – control group source("http://r.modas.lu/LibDEA.r")
DEA.limma
DEA.edgeR
DEA.DESeq
4343
4.2. DEA: Time Series Experiment
## Let's use limma for a time-series experiment Experiment: A375 cells stimulated by IFNg
## load the data that are in annotated text format
source("http://r.modas.lu/readAMD.r")
mRNA = readAMD("http://edu.modas.lu/data/txt/mrna_ifng.amd.txt",
stringsAsFactors=TRUE,
index.column="GeneSymbol",
sum.func="mean")
str(mRNA)

## attach library with warp-up functions

source("http://r.modas.lu/LibDEA.r")

## DEA: the most variable genes (by F-statistics)

ResF = DEA.limma(data = mRNA$X, group = mRNA$meta$time)
genes = order(ResF$FDR)[1:100] ## select top 100 genes
pheatmap(mRNA$X[genes,], cluster_col=FALSE, scale="row",
fontsize_row=2, fontsize_col=10, cellwidth=15,
main="Top 100 significant genes (F-stat)")
Annotation – Metadata – Data format
## DEA: genes differentially expressed (by moderated t-test)
Res24 = DEA.limma(data = mRNA$X,
group = mRNA$meta$time,
key0="T00",key1="T24")
## volcano plot
plotVolcano(Res24,thr.fdr=0.01,thr.lfc=1)
genes = order(Res24$FDR)[1:100] ## select top 100 genes
samples = grep("T00|T24",mRNA$meta$time) ## select T00,T24 sampl.
pheatmap(mRNA$X[genes,samples],cluster_col=FALSE,scale="row",
fontsize_row=2, fontsize_col=10, cellwidth=15,
main="Top 100 significant genes T24-T00 (moderated t-stat)") See more at http://edu.modas.lu/modas_dea/part3.html
4444
4.2. DEA: Time Series Experiment

## Save results

## save the most variable genes (by F-statistics)

write.table(ResF[ResF$FDR<0.0001,],file = "DEA_F.txt",
col.names=NA, sep="\t", quote=FALSE)
## save significant genes T24-vs-T00
write.table(Res24[Res24$FDR<0.001 & abs(Res24$logFC)>1,],
file = "DEA_T24-T00.txt",
col.names=NA, sep="\t", quote=FALSE)
## save gene list (response at 24 h of IFNg treatment)
write(Res24[Res24$FDR<0.0001,1],file="genes24.txt")

Please, investigate the results. Submit any list to

the functional annotation tool Enrichr
https://maayanlab.cloud/Enrichr/ for enrichment of thousands of genes
and tells us the functions 4545
4.3. Functional Annotation: Enrichment

Are interesting genes over-represented in a subset corresponding to some biological process?

B How surprised will you be if he

C grabbed
●●●●●●●●●●●●●●●●●●●●
No enrichment in C
(17 red , 3 green)
Method of the analysis:
Fisher’s exact test

4646
4.3. Gene Over-representation Analysis
Hypergeometrical: distribution of objects taken
Fisher’s exact test: based on from a “box”, without putting them back
hypergeometrical distributions

𝑛 𝑛!
𝐶𝑘𝑛 = 𝐶𝑛𝑘 = =
𝑘 𝑘! 𝑛 − 𝑘 !

4747
4.3. Gene Set Enrichment Analysis
Is the direction of all genes in a category random? last chance as it doesnt need any gene of intresest enriched
last resort for biologists

4848
Take Home Messages 4

If you are looking at a multi-factor / multi-treatment experiment, you may check the
variable genes (F-statistics based) first, and then go for the contrasts.

To find the biological meaning of the significantly regulated genes, please use
enrichment analysis methods linking known functional groups of genes to DEA results.

Enriched categories are usually more robust than individual genes. If you have no
significant genes – check gene sets by GSEA.

Enrichr David Reactome

https://maayanlab.cloud/Enrichr/ https://david.ncifcrf.gov/ https://reactome.org/

String WikiPathways
https://string-db.org/ https://wikipathways.org/

4949
Summary

Raw Data

QA/QC
+ Remove outliers

Normalization
(remove technical artefacts, Visualization and
make data comparable) exploratory analysis
(PCA, clustering)
DEA
Filtering Enrichment
(differential
(remove uninformative Processed Data expression
(GO, functions,
features) TFs, drugs)
analysis)

GSEA
Network Prediction (gene set
reconstruction (signatures for enrichment
(not considered) classification) analysis)

5050

CVP Excel Project
50% (4)
CVP Excel Project
10 pages
Rnaseq by Example
No ratings yet
Rnaseq by Example
163 pages
Broadband Premises Troubleshooting Par 1 Study Guide PDF
25% (4)
Broadband Premises Troubleshooting Par 1 Study Guide PDF
4 pages
Ed 5300 PDF
No ratings yet
Ed 5300 PDF
3 pages
RNA-Seq Analysis Course
No ratings yet
RNA-Seq Analysis Course
40 pages
Intro_to_RNA-seq_concepts
No ratings yet
Intro_to_RNA-seq_concepts
85 pages
RNA-Seq Module 1
No ratings yet
RNA-Seq Module 1
54 pages
M.sc Transcriptome Analysis 2025
No ratings yet
M.sc Transcriptome Analysis 2025
21 pages
nihms-977214
No ratings yet
nihms-977214
21 pages
Transcriptome Software Paper
No ratings yet
Transcriptome Software Paper
7 pages
RNA Seq R - Final Decode
No ratings yet
RNA Seq R - Final Decode
76 pages
Cm2 Debily m1 Funcgenprecmed 2024 25
No ratings yet
Cm2 Debily m1 Funcgenprecmed 2024 25
41 pages
Summary Bioinformation Technology
No ratings yet
Summary Bioinformation Technology
15 pages
Module8 RNASeq Pathogen Practical Manual
No ratings yet
Module8 RNASeq Pathogen Practical Manual
23 pages
Analysis of RNA-Seq Data
No ratings yet
Analysis of RNA-Seq Data
71 pages
ScRNA Seq Course
100% (1)
ScRNA Seq Course
337 pages
2023-GenomicaFuncional y Biocomputacion-Day1
No ratings yet
2023-GenomicaFuncional y Biocomputacion-Day1
92 pages
From RNA-seq Reads To Gene Expression
No ratings yet
From RNA-seq Reads To Gene Expression
27 pages
Introduction To Differential Gene Expression Analysis Using RNA-seq
No ratings yet
Introduction To Differential Gene Expression Analysis Using RNA-seq
97 pages
Intro 2 RNAseq
No ratings yet
Intro 2 RNAseq
98 pages
Affy Diffexp Clustering Exercise-1
No ratings yet
Affy Diffexp Clustering Exercise-1
16 pages
Combined
No ratings yet
Combined
417 pages
Gene Expression RNA Sequence
No ratings yet
Gene Expression RNA Sequence
120 pages
Day1 Laros RNASeq Galaxy 2012
No ratings yet
Day1 Laros RNASeq Galaxy 2012
40 pages
Survey RNA-Seq data analysis (2016)
No ratings yet
Survey RNA-Seq data analysis (2016)
19 pages
RNASeq Command Line 25march2021 0
No ratings yet
RNASeq Command Line 25march2021 0
33 pages
RNA seq Data Analysis
No ratings yet
RNA seq Data Analysis
90 pages
ExSeq Presentation With Background
No ratings yet
ExSeq Presentation With Background
40 pages
RNA Seq Tutorial
0% (1)
RNA Seq Tutorial
139 pages
Genomic Analyses Using Radseq: 1. Raw Data Manipulation
No ratings yet
Genomic Analyses Using Radseq: 1. Raw Data Manipulation
7 pages
Transcriptome Analysis
No ratings yet
Transcriptome Analysis
6 pages
Chapter 3 Inspection of Sequence Quality PDF
No ratings yet
Chapter 3 Inspection of Sequence Quality PDF
18 pages
Stryke
No ratings yet
Stryke
14 pages
Same Nva Tting
No ratings yet
Same Nva Tting
22 pages
RNA-Seq and Transcriptome Analysis: Jessica Holmes
No ratings yet
RNA-Seq and Transcriptome Analysis: Jessica Holmes
98 pages
3_RNAseq_background
No ratings yet
3_RNAseq_background
42 pages
NOISeq
No ratings yet
NOISeq
26 pages
CLC Genomics Workbench User Manual Subset
No ratings yet
CLC Genomics Workbench User Manual Subset
222 pages
Intro To Pneumatics Modified
No ratings yet
Intro To Pneumatics Modified
35 pages
Bioinformatics Experimental Design
No ratings yet
Bioinformatics Experimental Design
6 pages
NGS Data Analysis
No ratings yet
NGS Data Analysis
4 pages
Bioinfo Course Notes M1 2020 Dr Mbulli
No ratings yet
Bioinfo Course Notes M1 2020 Dr Mbulli
56 pages
Using Limma For Microarray and RNA-Seq Analysis
No ratings yet
Using Limma For Microarray and RNA-Seq Analysis
13 pages
IBB.MB.501 RNA-seq + introduction to galaxy
No ratings yet
IBB.MB.501 RNA-seq + introduction to galaxy
34 pages
Introduction To Bioinformatics
No ratings yet
Introduction To Bioinformatics
14 pages
s13059-020-1949-z
No ratings yet
s13059-020-1949-z
19 pages
Beginner's Guide To Using The DESeq2 Package
No ratings yet
Beginner's Guide To Using The DESeq2 Package
32 pages
Lab2
No ratings yet
Lab2
7 pages
Tutorial RNA-Seq Analysis Part 1
No ratings yet
Tutorial RNA-Seq Analysis Part 1
8 pages
R Tutorial
No ratings yet
R Tutorial
3 pages
Glossary of Terms B4B
No ratings yet
Glossary of Terms B4B
8 pages
combined-16-30
No ratings yet
combined-16-30
15 pages
3 RNAseq-Mapping LO
No ratings yet
3 RNAseq-Mapping LO
98 pages
4 RNAseq-Quantification LO
No ratings yet
4 RNAseq-Quantification LO
30 pages
Rcourse_partViz
No ratings yet
Rcourse_partViz
9 pages
PDxNucleus Brochure
No ratings yet
PDxNucleus Brochure
17 pages
R-Language-in-Bioinformatics
No ratings yet
R-Language-in-Bioinformatics
2 pages
NGS ToolsFormats r1 BDG
No ratings yet
NGS ToolsFormats r1 BDG
32 pages
Lab 8 Homepage
No ratings yet
Lab 8 Homepage
4 pages
Bulk Analyse R
No ratings yet
Bulk Analyse R
7 pages
Nanoscale CMOS: Innovative Materials, Modeling and Characterization
From Everand
Nanoscale CMOS: Innovative Materials, Modeling and Characterization
Francis Balestra
No ratings yet
LEARN MPLS FROM SCRATCH PART-B: A Beginners guide to next level of networking
From Everand
LEARN MPLS FROM SCRATCH PART-B: A Beginners guide to next level of networking
POONAM DEVI
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
A Semi-Detailed Lesson Plan in Mathematics 4
No ratings yet
A Semi-Detailed Lesson Plan in Mathematics 4
5 pages
sft2841 Manual PDF
50% (2)
sft2841 Manual PDF
4 pages
Slab Detail 07
No ratings yet
Slab Detail 07
1 page
BLOCK 3 Extra Bar
No ratings yet
BLOCK 3 Extra Bar
98 pages
Von Mises
No ratings yet
Von Mises
8 pages
Ansys Fluent 12.0 Text Command List: February 2009
No ratings yet
Ansys Fluent 12.0 Text Command List: February 2009
79 pages
Light
No ratings yet
Light
21 pages
20210531-8012 Form71420201200253
No ratings yet
20210531-8012 Form71420201200253
33 pages
'U1 Gauge Invariance
No ratings yet
'U1 Gauge Invariance
6 pages
Bearing Failure
100% (1)
Bearing Failure
22 pages
English Grammar
No ratings yet
English Grammar
29 pages
Hashing
No ratings yet
Hashing
14 pages
Control Lab PDF
No ratings yet
Control Lab PDF
21 pages
Kubernetes
No ratings yet
Kubernetes
17 pages
New Wordpad Document
No ratings yet
New Wordpad Document
10 pages
Data Warehousing
No ratings yet
Data Warehousing
8 pages
Bizhub PRO 1200 Series Product Guide 4.8
No ratings yet
Bizhub PRO 1200 Series Product Guide 4.8
73 pages
Calculating Equivalent Resistance, Total Current and Total
100% (1)
Calculating Equivalent Resistance, Total Current and Total
39 pages
Pre-Placements Checklist: Data Structures
No ratings yet
Pre-Placements Checklist: Data Structures
5 pages
C3 Differentiation - Implicit Differentiation
No ratings yet
C3 Differentiation - Implicit Differentiation
4 pages
Ap Physics B Lesson 64 76 Fluid Mechanics
No ratings yet
Ap Physics B Lesson 64 76 Fluid Mechanics
58 pages
Cartography PDF
No ratings yet
Cartography PDF
20 pages
Fiocchetti Mnl013 (Ectf-Control-Dmlp Monitor) 2007-09
100% (2)
Fiocchetti Mnl013 (Ectf-Control-Dmlp Monitor) 2007-09
44 pages
425 ch12 Lec S12
No ratings yet
425 ch12 Lec S12
102 pages
O Level Physics Magnetism
No ratings yet
O Level Physics Magnetism
21 pages
CNC Course
No ratings yet
CNC Course
118 pages
Eng. Statics Lab Manual Spring 2024
No ratings yet
Eng. Statics Lab Manual Spring 2024
35 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.