0% found this document useful (0 votes)
9 views

Nazarov QC-Statistics

Uploaded by

kinjal010902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Nazarov QC-Statistics

Uploaded by

kinjal010902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Petr Nazarov

MISB Course Transcriptomics (Prof. Dr. Stephanie Kreis) petr.nazarov@lih.lu


2023-10-24 http://edu.modas.lu/transcript-seq
11
Outline of the Course

1. Data overview 3. Statistical basics


RNA-seq data generation Hypothesis testing (p-value)
File formats, Phred-quality T-test, Wilcoxon test
Sequence-based QC: FastQC / MultiQC Multiple testing (FDR, FWER)
Statistical properties of the data Linear models: ANOVA

2. Exploratory data analysis 4. Statistics for RNA-seq


Distributions & boxplots Differential expression analysis
Dimensionality reduction: PCA, MDS, tSNE, UMAP EdgeR, DESeq, limma
Clustering Enrichment analysis
Heatmaps for expression and correlation
Detection of outliers

Please see scripts and materials online http://edu.modas.lu/transcript-seq


22
Part 1

1. Data Overview
http://edu.modas.lu/transcript-seq/part1.html

33
1.1. RNA-seq Data Generation
read: short fragment detected by RNA-seq
library: collection of all reads from the sample
CPM: counts per million nucleotides
TPM: transcripts per million (proportion)
FPKM: fragments per kilobase of exon per million reads mapped
RPKM: reads per ……. (for single-end)

Xi – observed number of reads


N – library size
li – length of the gene (transcript)

raw normalized counts,


counts CPM, FPKM, RPKM

10-minute simple explanation of TPM / FPKM


Wang Z et al. RNA-Seq: a revolutionary tool
https://www.youtube.com/watch?v=TTUrtCY2k-w
for transcriptomics. Nat Rev Genet. 2009
44
1.2. File Formats
@HWI-ST508:152:D06G9ACXX:2:1101:1160:2042 1:Y:0:ATCACG
NAAGACCGAATTCTCCAAGCTATGGTAAACATTGCACTGGCCTTTCATCTG Link with the
Raw image files (e.g.BCL)
+ detailed
#11??+2<<<CCB4AC?32@+1@AB1**1?AB<4=4>=BB<9=>?###### explanation
FASTQ files
@HD VN:1.0 SO:coordinate
@SQ SN:seq1 LN:5000
@SQ SN:seq2 LN:5000
Mapping, @CO Example of SAM/BAM file format.
alignment B7_591:4:96:693:509 73 seq1 1 99 36M *
0 0 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
<<<<<<<<<<<<<<<;<<<<<<<<<5<<<<<;:<;7 Link with the
SAM/BAM files MF:i:18 Aq:i:73 NM:i:0 UQ:i:0 H0:i:1 detailed
H1:i:0EAS54_65:7:152:368:11373 seq1 3 99 35M *
explanation
0 0
CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
Counting <<<<<<<<<<0<<<<655<<7<<<:9<<3/:<6): MF:i:18 Aq:i:66 NM:i:0
UQ:i:0 H0:i:1 H1:i:0

Raw counts ID Gene.Symbol A1 A2 A3 A4 B1 B2


ENSG00000135899 SP110 32 31 33 33 136 136
ENSG00000154451 GBP5 0 0 0 0 395 383
ENSG00000226025 LGALS17A 0 0 0 0 217 196
Normalized counts ENSG00000213512 GBP7 0 0 0 0 44 47
CPM, TPM, RPKM… ENSG00000260873 SNTB2 198 193 195 196 483 502
ENSG00000063046 EIF4B 552 546 548 550 428 429
Advantage of RNA-seq:
ENSG00000102524 TNFSF13B you can repeat the0 pipeline0 with new
0 knowledge
0 or16questions
17 55
1.2. File Formats
@HWI-ST508:152:D06G9ACXX:2:1101:1160:2042 1:Y:0:ATCACG
NAAGACCGAATTCTCCAAGCTATGGTAAACATTGCACTGGCCTTTCATCTG
+
#11??+2<<<CCB4AC?32@+1@AB1**1?AB<4=4>=BB<9=>?######

Quality scores started as numbers (0-40) but have since changed to


an ASCII encoding to reduce filesize and make working with this
format a bit easier, however they still hold the same information.
ASCII codes are assigned based on the formula found below. This
table can serve as a lookup as you progress through your analysis.

https://learn.gencore.bio.nyu.edu/ngs-file-formats/quality-scores/
66
1.3. Sequence-based QC: FastQC
FastQC – a simple but widely-used Java-based tool for quality control of the experiments at the sequence level. It provides
a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which
you should be aware before doing any further analysis.
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

• Import of data from BAM, SAM or FastQ files (any variant)


• Providing a quick overview to tell you in which areas there
may be problems
• Summary graphs and tables to quickly assess your data
• Export of results to an HTML based permanent report
• Offline operation to allow automated generation of
reports without running the interactive application

Examples

More detailed explanation & examples:


https://scienceparkstudygroup.github.io/rna-seq-lesson/03-qc-of-sequencing-results/index.html#31-running-fastqc
77
1.3. Sequence-based QC: MultiQC
A modular tool to aggregate results from bioinformatics analyses across
many samples into a single report. Python-based
https://multiqc.info/ - see example online.

Introduction: https://www.youtube.com/watch?v=BbScv9TcaMg

88
1.4. Statistical Properties of the Data
ID Gene.Symbol A1 A2 A3 A4 B1 B2
ENSG00000135899 SP110 32 31 33 33 136 136
ENSG00000154451 GBP5 0 0 0 0 395 383
ENSG00000226025 LGALS17A 0 0 0 0 217 196
ENSG00000213512 GBP7 0 0 0 0 44 47
ENSG00000260873 SNTB2 198 193 195 196 483 502
ENSG00000063046 EIF4B 552 546 548 550 428 429
ENSG00000102524 TNFSF13B 0 0 0 0 16 17
ENSG00000107201 DDX58 79 81 82 77 296 310
ENSG00000010030 ETV7 2 2 2 0 93 85
ENSG00000125347 IRF1 22 24 27 22 234 236
ENSG00000180616 SSTR2 0 0 0 0 19 21
ENSG00000155962 CLIC2 2 2 1 1 71 65
ENSG00000153944 MSI2 55 54 54 54 37 37
ENSG00000197646 PDCD1LG2 0 0 0 0 58 60
ENSG00000108771 DHX58 5 4 4 5 26 25
ENSG00000100336 APOL4 9 8 11 8 130 135
ENSG00000182551 ADI1 88 86 88 89 59 60
ENSG00000128284 APOL3 14 14 14 13 85 94
ENSG00000153989 NUS1
same
214
condition,
216
same
212
gene
214 167 167
ENSG00000131979 GCH1 57 61 57 56 172 167

Poisson distribution Negative binomial distribution Normal distribution


Can be used for
log(1+k), when k is
large, but it is
approximate
1 parameter 2 parameters => less power
to fit () => to fit (p,r) => (still usable but may
fits biology better! miss interesting
too simple!
cases) 99
1.5. Data Repositories
GEO: http://www.ncbi.nlm.nih.gov/gds TCGA: https://tcga-data.nci.nih.gov/tcga/
~11k tumor samples
US-based repository
of omics data
Analysis via:
http://www.cbioportal.org
/public-portal/

ArrayExpress: http://www.ebi.ac.uk/arrayexpress/

EU-based repository GTEx: https://www.gtexportal.org/home/


of omics data
~17k healthy samples

1010
Take Home Messages 1
use for exam in JAN check the word document with overview of everything on
MODAS.lu
nf-core for analysis pf RNA seq data, metagenome, CHIP seq analysis, nansospre
RNA-seq can be used as row counts and normalized (TPM, FPKM). See what you need for a
specific algorithm!

For QC of your samples at the sequence level – use FastQC. To combine results - MultiQC

Expression-related data in transcriptomics are strongly right-skewed. Therefore:


For statistics use either precise distribution (negative binomial for RNA-seq)
or work with log-transformed data
Use log-transformed data for exploratory analysis and visualization

Several large repositories of the data exist. Before planning your experiments – make a
search for existing data
1111
Part 2

2. Exploratory Data
Analysis
http://edu.modas.lu/transcript-seq/part2.html

see more here: http://edu.modas.lu/modas_eda/


1212
2.1. Distributions
## download and load the data
noisy data directly from the machine
url = "http://edu.modas.lu/data/rda/LUSC60.RData"
download.file(url, destfile="LUSC60.RData",
mode = "wb")
load("LUSC60.RData")
str(LUSC)

## log transform the data and put it to X


X = log2(1+LUSC$counts)

## density plot
plot(density(X), col="blue", lwd=2)

## boxplot
boxplot(X, col="lightblue", las=2)
"normalized" data?

box plots is an old method to visualize data


## simple normalization (do not use ;) )
XN = scale(X)
reminder: boxplot definition
boxplot(XN, col="lightblue", las=2)

## try this:
plot(density(X), col="black", lwd=2)
for (i in 1:ncol(X))
lines(density(X[,i]),col="#0000FF33")

1313
2.2. Dimensionality Reduction
Each sample (object) is represented by 20 000 genes (features)… How can we visualized samples in understandable way?
 Use dimensionality reduction! check the data in a different "angle" that shows the most majority of variability
PCA - rotation of the coordinate system in multidimensional space in
the way to capture main variability in the data. check webiste that
shows it follows the
most variable data
ICA - matrix factorization method that identifies statistically
basic mathematical operations independent signals and their weights.

NMF - matrix factorization method that presents data as a matrix


product of two non-negative matrices.

LDA - identify new coordinate system, maximizing difference


between objects belonging to predefined groups (see Fig.).

MDS - method that tries to preserve distances between objects in


the low-dimension space.

AE - artificial neural network with a “bottle-neck”.

Please check some nice interactive resources online: t-SNE - an iterative approach, similar to MDS, but considering only
close objects. Thus, similar objects must be close in the new
• Principal Component Analysis Explained Visually by Victor Powell (reduced) space, while distant objects are not influencing the results.
• Understanding UMAP by Andy Coenen and Adam Pearce
UMAP - modern method, similar to t-SNE, but more stable and with
• Dimensionality Reduction for Data Visualization: PCA vs TSNE vs preservation of some information about distant groups (preserving
UMAP vs LDA by Sivakar Sivarajah topology of the data).
1414
2.2. Dimensionality Reduction: PCA
Principal component analysis (PCA) 20000 genes 
is a vector space transform used to reduce multidimensional data sets to lower dimensions
for analysis. It selects the coordinates along which the variation of the data is bigger.
2 dimensions

For the simplicity let us consider 2 parametric situation both in terms of data and resulting PCA.

Scatter plot in
Scatter plot in PC
“natural” coordinates

Second component
Variable 2

Variable 1 First component

Instead of using 2 “natural” parameters for the classification, we can use the first component!
1515
2.2. Dimensionality Reduction: PCA example 1

techincal effect, biologicall effect and time effect respectively


batch effect

1616
2.2. Dimensionality Reduction: PCA example 2
pca tries to capture differentiated data more than similar data
## download and load the data for TCGA
## lung squamous cell carcinoma patients

## download and load the data


url = "http://edu.modas.lu/data/rda/LUSC60.RData"
download.file(url, destfile="LUSC60.RData", tumor in red
mode = "wb")
load("LUSC60.RData")
str(LUSC)
## log transform the data and put it to X
X = log2(1+LUSC$counts)

##----------------------------------------------------
## exclude genes with 0 variance
X = X[apply(X,1,var)>0,]
## Run PCA on the transposed X
PC = prcomp(t(X),scale=TRUE) NT: non-tumor
str(PC) TP: tumor primary
## Visualize red dot here
plot(PC$x[,1],PC$x[,2], pch=19,
col=c(NT="blue",TP="red")[LUSC$meta$sample_type])

##---= or use my warp-up =---


source("http://r.modas.lu/plotPCA.r")
plotPCA(X, cex=1.5,
col = c(NT="#0000FF55",
TP="#FF000055")[LUSC$meta$sample_type])

##-------------------------------------------------------
## Task: plot PCA for the complete dataset More at http://edu.modas.lu/modas_eda/part3.html
url = "http://edu.modas.lu/data/rda/LUSC.RData"
1717
unsupervised machine learning 2.3. Clustering

from one coordinate to another to measure distance

1818
2.3. Clustering: k-means
exact clustering, does not overlap at all robust but stiff, k has to be defined

k-Means Clustering
k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which
each observation belongs to the cluster with the nearest mean.

1) k initial "means" (in this 4) Steps 2 and 3 are


case k=3) are randomly repeated until
selected from the data set 2) k clusters are created by
convergence has been
(shown in color). associating every
reached.
observation with the 3) The centroid of each of
nearest mean. the k clusters becomes the
new means. http://wikipedia.org 1919
2.3. Clustering: k-means
## here we will use previously calculated X from LUSC60
load("LUSC60.RData")
X = log2(1+LUSC$counts)
X = X[apply(X,1,var)>0,]

## k-means clustering
clusters = kmeans(x=t(X),centers=2,nstart=10)$cluster
## validate clusters
table(clusters,LUSC$meta$sample_type)
## get PCA results (use old if you have)
PC = prcomp(t(X),scale=TRUE)
## visualize as PCA, use colors to represent clusters
plot(PC$x[,1],PC$x[,2],col = clusters, pch=19, cex=2,
main="PCA, colored by k-means clusters")

## Exercise: do the same with standard `iris` data


View(iris)

## k-means clustering
clusters = kmeans(iris[,-5],centers=3,nstart=10)$cluster

## validate clusters
table(clusters,iris[,5])
clustering is not always optimal, it
looks like its overlapping here(pink ## get PCA results (use old if you have)
&green). PC = prcomp(iris[,-5])
we have to know number of clusters
to understand it, without context it
could be misunderstood
## visualize as PCA, use colors to represent clusters
plot(PC$x[,1],PC$x[,2],col = clusters, pch=19, cex=2,
main="PCA, colored by k-means clusters")

2020
2.3. Clustering: hierarchical
unstable when data is changed but easy to visualize

Hierarchical Clustering
Hierarchical clustering creates a hierarchy of clusters that may be represented in a tree structure called a dendrogram.
The root of the tree consists of a single cluster containing all observations, and the leaves correspond to individual
observations.
Algorithms for hierarchical clustering are generally either agglomerative, in which one starts at the leaves and successively
merges clusters together; or divisive, in which one starts at the root and recursively splits the clusters.

Elements

we can choose
where to cut the
"tree"

Agglomerative

Divisive
start w
indiviadual
objects and
grouping
them
together

Dendrogram

http://wikipedia.org
Distance: Euclidean 2121
2.3. Clustering: k-means
## here we will use previously calculated X from LUSC60

## hierarchical clustering – generate tree and show it


hc = hclust(dist(t(X)))
plot(hc) distance of data

## cut the tree to have k clusters


clusters = cutree(hc, k=3)
table(clusters, LUSC$meta$sample_type)

## visualize as PCA, use colors to represent clusters


PC = prcomp(t(X),scale=TRUE)
plot(PC$x[,1],PC$x[,2],col = clusters, pch=19, cex=2,
main = "PCA, colored by hierarchical clusters")

after cutting tree

advantage: can have many clusters

## Exercise: do the same with `iris` data


View(iris)
...

2222
to present 3D data (sample, gene) 2.4. Heatmaps
## Heatmaps. Visualize the most variable genes in LUSC60
#install.packages("pheatmap" )
library(pheatmap)

## identify the most variable genes


plot(density(apply(X,1,sd)))
ikeep = apply(X,1,sd)>3
table(ikeep)
scaling imp bcus genes are expressed on different levels,
## draw a heatmap
pheatmap(X[ikeep,],scale="row",fontsize_row=1, fontsize_col=5,
main="Top variable genes"))

## Exercise: try scale="none", select 1000+ variable genes

if we forget to scale, hard to


understand visually

lowly expressed highly expressed

nt normal tissue tp tumor primary 2323


2.4. Heatmaps: correlation for QC
## Heatmaps of correlation can be used for QC
library(pheatmap)

## calculate correlation between samples (columns)


R = cor(X, method="pearson")

nt ## draw a heatmap of correlation


pheatmap(R,fontsize_row=5, fontsize_col=5,
main="Correlation between samples")

Correlation between samples can be used to identify outliers or


swap mistakes in the experiment. In case you have outliers or
work with non-log transformed data, you could use
method = "spearman" – a non-parametric correlation.
instead of pearson

Note: see the average correlation between samples ~0.9.


Why? correlation will always be high, bcus some genes will always be higly expressed or some
TPs will always be lowly expressed

correlation 0 when 2 patients samples are compared and their difference is heatmapped

cancer tissue less similar than nt

2424
Take Home Messages 2

Always check the distribution of your data! It can help you decide about pre-processing
(log-transformation, normalization) and identify outliers.

Use PCA and correlation to identify outliers or strangely behaving samples. PCA can also
show you the effects of experimental factors

Use clustering to group your data (unsupervised approach)


k-means method is very robust but you should know the number of clusters k.
and add or change
Hierarchical clustering is quite flexible (k is variable) but not stable in case you exclude a few samples.

Heatmap is a nice tool to visualize the expression of genes over the samples.
Use a heatmap of correlations to check similarities and groups in your samples.
2525
Part 3

3. Statistical Basics
http://edu.modas.lu/transcript-seq/part3.html

see more here: http://edu.modas.lu/modas_dea/index.html


2626
3. Statistical Basics
Questions
Which genes have changes in mean expression level between conditions?
How reliable are this observations (what is your p-value or FDR?)
Differential
Expression
Analysis (DEA)
Similar to t-test with Similar to ANOVA with
Student’s statistics: Single factor, two Multifactor or Fisher’s statistics:
compare means conditions multicondition compare variances

Post-hoc analysis Example: two cell lines in time:


What are those?.. And do not forget about
multiple hypotheses testing
• hypotheses
• p-values
• FDR
• t-test
• ANOVA
2727
3.1. Hypothesis Testing
When statisticians would like to make a claim, they do this in the form of hypothesis testing. In
hypothesis testing, we begin by making a tentative assumption about a population parameter, i.e. by
formulation of a null hypothesis.

Null hypothesis
The hypothesis tentatively assumed true in the hypothesis testing procedure, H0 .
For safety reasons, we assume a situation when nothing “interesting” happens as H0

Alternative hypothesis
The hypothesis concluded to be true if the null hypothesis is rejected, Ha
Ha will be a situation when we see something unusual, which requires action

Hypotheses in a simplest case: comparing mean to a constant


One-tailed Two-tailed
H0:   const H0:   const H0:  = const
Ha:  > const Ha:  < const Ha:   const
2828
3.1. Hypothesis Testing

False Negative,
 error

False Positive,
 error

2929
3.1. Hypothesis Testing: p-value
One-tailed test
A hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its
sampling distribution

H0:   0
Ha:  < 0
A Trade Commission (TC) periodically conducts statistical studies designed to test the claims that
manufacturers make about their products. For example, the label on a large can of Hilltop Coffee
states that the can contains 3 pounds of coffee. The TC knows that Hilltop's production process
cannot place exactly 3 pounds of coffee in each can, even if the mean filling weight for the population
of all cans filled is 3 pounds per can. However, as long as the population mean filling weight is at least
3 pounds per can, the rights of consumers will be protected. Thus, the TC interprets the label
information on a large can of coffee as a claim by Hilltop that the population mean filling weight is at
least 3 pounds per can. We will show how the TC can check Hilltop's claim by conducting a lower tail
hypothesis test.

Suppose a sample of n = 36 coffee cans is selected. From the previous studies, it’s
0 = 3 lbm
known that  = 0.18 lbm
3030
3.1. Hypothesis Testing: p-value
0 = 3 lbm Suppose a sample of n = 36 coffee cans is selected and m = 2.92 is
observed. From the previous studies, it’s known that  = 0.18 lbm

H0:   3 no action

Ha:  < 3 legal action

Let’s say: in the extreme case, when =3, we would like to be 99% sure that we make
no mistake, when starting legal actions against Hilltop Coffee. It means that selected
significance level is  = 0.01

should be testes OK
3131
3.1. Hypothesis Testing: p-value
Let’s find the probability of observation m for all possible   3. We start from an extreme case
(=3) and then probe all possible  > 3. See the behavior of the small probability area around
measured m. What you will get if you summarize its area for all possible   3 ?
## Calculate p-value: one sample mean
p value is level of making the mistake
## parameters
n = 36
mu = 3

## calculation (if no data available)


m = 2.92 # m – mean from experiment
sigma = 0.18 # st.dev. known beforehand
z = (m-mu)/sigma * sqrt(n)
pnorm(z) # 0.003830381

## calculation (if data is available)


url="http://edu.modas.lu/data/txt/coffee.txt"
x = scan(url)
t.test(x, mu = mu, alternative="less")

| |
m 0
P(m) for all possible   0 is equal to P(x<m) for an extreme case of  = 0
3232
3.2. Hypothesis Testing: Multiple Testing
## Why do we need multiple testing correction?

## 1. Generate a random matrix: 1000 genes x 6 samples


X = matrix(rnorm(6*1000),nrow=1000,ncol=6)
rownames(X) = paste0("gene",1:1000)

## 2. Assume col 1,2,3 - exp, 4,5,6 - ctrl


colnames(X) = c("exp1","exp2","exp3","ctrl1","ctrl2","ctrl3")

## 3. Do a t.test for each "gene" (slow, but who cares :)


pv = NULL
for (i in 1:nrow(X))
pv[i] = t.test(X[i,1:3],X[i,4:6])$p.value

table(pv < 0.05) # around 50 false positives are expected

## do FDR adjustment
fdr = p.adjust(pv,"fdr")
table(fdr < 0.05)

3333
3.2. Multiple Testing

False Negative,
 error

False Positive,
 error

Probability of an error in a multiple test, when =0.05: 1–(0.95)number of comparisons


3434
3.2. Multiple Testing: FDR

False discovery rate (FDR)


FDR control is a statistical method used in multiple hypothesis testing to correct for multiple
comparisons. In a list of rejected hypotheses, FDR controls the expected proportion of incorrectly
rejected null hypotheses (type I errors).

Population Condition
H0 is TRUE H0 is FALSE Total
Accept H0
m–R
Conclusion

U T
(non-significant)
Reject H0
V S R
(significant)
Total m0 m – m0 m

 V 
FDR  E 
V  S 
3535
3.2. Multiple Testing: FDR and FWER
False Discovery Rate: Benjamini & Hochberg
 V 
Assume we need to perform m = 100 comparisons, FDR  E 
and select maximum FDR =  = 0.05  V  S 

k
Expected value for FDR <  if P( k )   mP( k )
m 
k
genral usage p.adjust(pv, method="fdr") Theoretically, the sign should be “≤”.
But for practical reasons it is replaced by “<“

Familywise Error Rate (FWER) not recommemded

Bonferroni – simple, but too stringent, not recommended mP(k )   adjusted p value

Holm-Bonferroni – a more powerful, less stringent but still universal FWER


in case of best
gene sureity p.adjust(pv, method="holm") m  1  k P( k )   3636
3.3. Linear Models

Many conditions If we would use pairwise comparisons, what


will be the probability of getting error?
We have measurements for 5 conditions.
5!
Are the means for these conditions Number of comparisons: C25   10
2!3!
equal?
Probability of an error: 1–(0.95)10 = 0.4

Many factors
We assume that we have several factors
affecting our data. Which factors are
most significant? Which can be
neglected?

ANOVA
example from Partek™
3737
3.3. Linear Models: ANOVA

As part of a long-term study of individuals 65 years of age or older, sociologists and physicians at the Wentworth Medical Center
in upstate New York investigated the relationship between geographic location and depression. A sample of 60 individuals, all in
reasonably good health, was selected; 20 individuals were residents of Florida, 20 were residents of New York, and 20 were
residents of North Carolina. Each of the individuals sampled was given a standardized test to measure depression. The data
collected follow; higher test scores indicate higher levels of depression.
Q: Is the depression level same in all 3 locations?

depression.txt H0: 1= 2= 3


1. Good health respondents Ha: not all 3 means are equal
Florida New York N. Carolina
3 8 10
7 11 7
7 9 3
3 7 5
8 8 11
8 7 8
… … …

3838
3.3. Linear Models: ANOVA

14 H0: 1= 2= 3

12 Ha: not all 3 means are equal

10 Please see the code and explanation online:


m2
Depression level

http://edu.modas.lu/transcript-seq/part3.html
8 m3
## load data (*)

6
m1 Dep = read.table("depression2.txt",
header=T, sep="\t", as.is=FALSE)
str(Dep)

4 ## run 1-factor ANOVA


DepGH = Dep[Dep$Health == "good",]
res1 = aov(Depression ~ Location, DepGH)
summary(res1)
2
TukeyHSD(res1)

## run 2-factor ANOVA


0 res2 = aov( Depression ~
NY
NY
NY
NY
NY
NY
NY
NC
NC
NC
NC
NC
NC
FL
FL
FL
FL
FL
FL
FL

Location + Health + Location*Health,


Dep)
Measures summary(res2)
TukeyHSD(res2)

(*) http://edu.modas.lu/data/txt/depression2.txt
3939
Take Home Messages 3

When doing multiple hypothesis testing and selecting only those elements
which are significant – always use FDR (or other, like FWER) correction! note for exams

the simplest correction – multiply the p-value by the number of genes. Is it still
significant? Use FDR (Benjamini-Hochberg) or FWER (Holm)

DEA detects the genes which have changed mean gene expression
between condition
=> The more data you have, the smaller differences you will be able to see

Several factors can be taken into account in ANOVA approach. This will give
you insight into the significance of each experimental factor but at the same
time will correct batch effects and allow you to answer complex questions
(remember shoes affecting ladies…).

4040
Part 4

4. Statistics for RNA-seq


http://edu.modas.lu/transcript-seq/part4.html

see more here: http://edu.modas.lu/modas_dea/index.html


4141
4.1. Linear Models for Transcriptomics Data

Yij = µi + Aj + Bj + Aj∗Bj + ϵij i – gene index


j – sample index
avg gene exp

Aj∗Bj – effect which cannot be explained by superposition A and B

Limma – R package for DEA in microarrays or RNA-seq based on linear models. less sensitive but fast
It is similar to t-test / ANOVA but uses all available data for variance estimation, thus it has
higher power when the number of replicates is limited. It assumes a normal distribution of
values for the gene between replicates. Apply it to normalized, log-transformed counts.

edgeR – R package for DEA in RNA-Seq, based on linear models and negative binomial
distribution of counts. Apply to raw counts! generally

Better noise model results in higher power detecting differentially expressed genes. It assumes
a negative-binomial distribution of values for the gene between replicates.

DESeq2 – another R package for DEA in RNA-Seq, based on the negative binomial distribution
of counts. DESeq2 is the most sensitive among others. Apply to raw counts!
Better noise model results in higher power detecting differentially expressed genes. It assumes
a negative-binomial distribution of values for the gene between replicates.
4242
4.2. DEA: Preparing for limma, edgeR, DESeq2

## install packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
Here we should define contrasts: BiocManager::install("limma")
"condition1 – condition2" BiocManager::install("edgeR")
BiocManager::install("DESeq2")
condition 1 – experimental group
## if you wish, you can use my simple warp-up
condition 2 – control group source("http://r.modas.lu/LibDEA.r")
DEA.limma
DEA.edgeR
DEA.DESeq
4343
4.2. DEA: Time Series Experiment
## Let's use limma for a time-series experiment Experiment: A375 cells stimulated by IFNg
## load the data that are in annotated text format
source("http://r.modas.lu/readAMD.r")
mRNA = readAMD("http://edu.modas.lu/data/txt/mrna_ifng.amd.txt",
stringsAsFactors=TRUE,
index.column="GeneSymbol",
sum.func="mean")
str(mRNA)

## attach library with warp-up functions


source("http://r.modas.lu/LibDEA.r")

## DEA: the most variable genes (by F-statistics)


ResF = DEA.limma(data = mRNA$X, group = mRNA$meta$time)
genes = order(ResF$FDR)[1:100] ## select top 100 genes
pheatmap(mRNA$X[genes,], cluster_col=FALSE, scale="row",
fontsize_row=2, fontsize_col=10, cellwidth=15,
main="Top 100 significant genes (F-stat)")
Annotation – Metadata – Data format
## DEA: genes differentially expressed (by moderated t-test)
Res24 = DEA.limma(data = mRNA$X,
group = mRNA$meta$time,
key0="T00",key1="T24")
## volcano plot
plotVolcano(Res24,thr.fdr=0.01,thr.lfc=1)
genes = order(Res24$FDR)[1:100] ## select top 100 genes
samples = grep("T00|T24",mRNA$meta$time) ## select T00,T24 sampl.
pheatmap(mRNA$X[genes,samples],cluster_col=FALSE,scale="row",
fontsize_row=2, fontsize_col=10, cellwidth=15,
main="Top 100 significant genes T24-T00 (moderated t-stat)") See more at http://edu.modas.lu/modas_dea/part3.html
4444
4.2. DEA: Time Series Experiment

## Save results

## save the most variable genes (by F-statistics)


write.table(ResF[ResF$FDR<0.0001,],file = "DEA_F.txt",
col.names=NA, sep="\t", quote=FALSE)
## save significant genes T24-vs-T00
write.table(Res24[Res24$FDR<0.001 & abs(Res24$logFC)>1,],
file = "DEA_T24-T00.txt",
col.names=NA, sep="\t", quote=FALSE)
## save gene list (response at 24 h of IFNg treatment)
write(Res24[Res24$FDR<0.0001,1],file="genes24.txt")

Please, investigate the results. Submit any list to


the functional annotation tool Enrichr
https://maayanlab.cloud/Enrichr/ for enrichment of thousands of genes
and tells us the functions 4545
4.3. Functional Annotation: Enrichment

Are interesting genes over-represented in a subset corresponding to some biological process?

Highly enriched category A


Someone grabs “randomly” 20 balls
from a box with 50x ● and 50x ●
●●●●●●●●●●●●●●●●●●●●
A ●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●
Enriched category B

B How surprised will you be if he


C grabbed
●●●●●●●●●●●●●●●●●●●●
No enrichment in C
(17 red , 3 green)
Method of the analysis:
Fisher’s exact test

4646
4.3. Gene Over-representation Analysis
Hypergeometrical: distribution of objects taken
Fisher’s exact test: based on from a “box”, without putting them back
hypergeometrical distributions

𝑛 𝑛!
𝐶𝑘𝑛 = 𝐶𝑛𝑘 = =
𝑘 𝑘! 𝑛 − 𝑘 !

4747
4.3. Gene Set Enrichment Analysis
Is the direction of all genes in a category random? last chance as it doesnt need any gene of intresest enriched
last resort for biologists

4848
Take Home Messages 4

If you are looking at a multi-factor / multi-treatment experiment, you may check the
variable genes (F-statistics based) first, and then go for the contrasts.

To find the biological meaning of the significantly regulated genes, please use
enrichment analysis methods linking known functional groups of genes to DEA results.

Enriched categories are usually more robust than individual genes. If you have no
significant genes – check gene sets by GSEA.

Enrichr David Reactome


https://maayanlab.cloud/Enrichr/ https://david.ncifcrf.gov/ https://reactome.org/

String WikiPathways
https://string-db.org/ https://wikipathways.org/

4949
Summary

Raw Data

QA/QC
+ Remove outliers

Normalization
(remove technical artefacts, Visualization and
make data comparable) exploratory analysis
(PCA, clustering)
DEA
Filtering Enrichment
(differential
(remove uninformative Processed Data expression
(GO, functions,
features) TFs, drugs)
analysis)

GSEA
Network Prediction (gene set
reconstruction (signatures for enrichment
(not considered) classification) analysis)

5050

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy