Nazarov QC-Statistics
Nazarov QC-Statistics
1. Data Overview
http://edu.modas.lu/transcript-seq/part1.html
33
1.1. RNA-seq Data Generation
read: short fragment detected by RNA-seq
library: collection of all reads from the sample
CPM: counts per million nucleotides
TPM: transcripts per million (proportion)
FPKM: fragments per kilobase of exon per million reads mapped
RPKM: reads per ……. (for single-end)
https://learn.gencore.bio.nyu.edu/ngs-file-formats/quality-scores/
66
1.3. Sequence-based QC: FastQC
FastQC – a simple but widely-used Java-based tool for quality control of the experiments at the sequence level. It provides
a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which
you should be aware before doing any further analysis.
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Examples
Introduction: https://www.youtube.com/watch?v=BbScv9TcaMg
88
1.4. Statistical Properties of the Data
ID Gene.Symbol A1 A2 A3 A4 B1 B2
ENSG00000135899 SP110 32 31 33 33 136 136
ENSG00000154451 GBP5 0 0 0 0 395 383
ENSG00000226025 LGALS17A 0 0 0 0 217 196
ENSG00000213512 GBP7 0 0 0 0 44 47
ENSG00000260873 SNTB2 198 193 195 196 483 502
ENSG00000063046 EIF4B 552 546 548 550 428 429
ENSG00000102524 TNFSF13B 0 0 0 0 16 17
ENSG00000107201 DDX58 79 81 82 77 296 310
ENSG00000010030 ETV7 2 2 2 0 93 85
ENSG00000125347 IRF1 22 24 27 22 234 236
ENSG00000180616 SSTR2 0 0 0 0 19 21
ENSG00000155962 CLIC2 2 2 1 1 71 65
ENSG00000153944 MSI2 55 54 54 54 37 37
ENSG00000197646 PDCD1LG2 0 0 0 0 58 60
ENSG00000108771 DHX58 5 4 4 5 26 25
ENSG00000100336 APOL4 9 8 11 8 130 135
ENSG00000182551 ADI1 88 86 88 89 59 60
ENSG00000128284 APOL3 14 14 14 13 85 94
ENSG00000153989 NUS1
same
214
condition,
216
same
212
gene
214 167 167
ENSG00000131979 GCH1 57 61 57 56 172 167
ArrayExpress: http://www.ebi.ac.uk/arrayexpress/
1010
Take Home Messages 1
use for exam in JAN check the word document with overview of everything on
MODAS.lu
nf-core for analysis pf RNA seq data, metagenome, CHIP seq analysis, nansospre
RNA-seq can be used as row counts and normalized (TPM, FPKM). See what you need for a
specific algorithm!
For QC of your samples at the sequence level – use FastQC. To combine results - MultiQC
Several large repositories of the data exist. Before planning your experiments – make a
search for existing data
1111
Part 2
2. Exploratory Data
Analysis
http://edu.modas.lu/transcript-seq/part2.html
## density plot
plot(density(X), col="blue", lwd=2)
## boxplot
boxplot(X, col="lightblue", las=2)
"normalized" data?
## try this:
plot(density(X), col="black", lwd=2)
for (i in 1:ncol(X))
lines(density(X[,i]),col="#0000FF33")
1313
2.2. Dimensionality Reduction
Each sample (object) is represented by 20 000 genes (features)… How can we visualized samples in understandable way?
Use dimensionality reduction! check the data in a different "angle" that shows the most majority of variability
PCA - rotation of the coordinate system in multidimensional space in
the way to capture main variability in the data. check webiste that
shows it follows the
most variable data
ICA - matrix factorization method that identifies statistically
basic mathematical operations independent signals and their weights.
Please check some nice interactive resources online: t-SNE - an iterative approach, similar to MDS, but considering only
close objects. Thus, similar objects must be close in the new
• Principal Component Analysis Explained Visually by Victor Powell (reduced) space, while distant objects are not influencing the results.
• Understanding UMAP by Andy Coenen and Adam Pearce
UMAP - modern method, similar to t-SNE, but more stable and with
• Dimensionality Reduction for Data Visualization: PCA vs TSNE vs preservation of some information about distant groups (preserving
UMAP vs LDA by Sivakar Sivarajah topology of the data).
1414
2.2. Dimensionality Reduction: PCA
Principal component analysis (PCA) 20000 genes
is a vector space transform used to reduce multidimensional data sets to lower dimensions
for analysis. It selects the coordinates along which the variation of the data is bigger.
2 dimensions
For the simplicity let us consider 2 parametric situation both in terms of data and resulting PCA.
Scatter plot in
Scatter plot in PC
“natural” coordinates
Second component
Variable 2
Instead of using 2 “natural” parameters for the classification, we can use the first component!
1515
2.2. Dimensionality Reduction: PCA example 1
1616
2.2. Dimensionality Reduction: PCA example 2
pca tries to capture differentiated data more than similar data
## download and load the data for TCGA
## lung squamous cell carcinoma patients
##----------------------------------------------------
## exclude genes with 0 variance
X = X[apply(X,1,var)>0,]
## Run PCA on the transposed X
PC = prcomp(t(X),scale=TRUE) NT: non-tumor
str(PC) TP: tumor primary
## Visualize red dot here
plot(PC$x[,1],PC$x[,2], pch=19,
col=c(NT="blue",TP="red")[LUSC$meta$sample_type])
##-------------------------------------------------------
## Task: plot PCA for the complete dataset More at http://edu.modas.lu/modas_eda/part3.html
url = "http://edu.modas.lu/data/rda/LUSC.RData"
1717
unsupervised machine learning 2.3. Clustering
1818
2.3. Clustering: k-means
exact clustering, does not overlap at all robust but stiff, k has to be defined
k-Means Clustering
k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which
each observation belongs to the cluster with the nearest mean.
## k-means clustering
clusters = kmeans(x=t(X),centers=2,nstart=10)$cluster
## validate clusters
table(clusters,LUSC$meta$sample_type)
## get PCA results (use old if you have)
PC = prcomp(t(X),scale=TRUE)
## visualize as PCA, use colors to represent clusters
plot(PC$x[,1],PC$x[,2],col = clusters, pch=19, cex=2,
main="PCA, colored by k-means clusters")
## k-means clustering
clusters = kmeans(iris[,-5],centers=3,nstart=10)$cluster
## validate clusters
table(clusters,iris[,5])
clustering is not always optimal, it
looks like its overlapping here(pink ## get PCA results (use old if you have)
&green). PC = prcomp(iris[,-5])
we have to know number of clusters
to understand it, without context it
could be misunderstood
## visualize as PCA, use colors to represent clusters
plot(PC$x[,1],PC$x[,2],col = clusters, pch=19, cex=2,
main="PCA, colored by k-means clusters")
2020
2.3. Clustering: hierarchical
unstable when data is changed but easy to visualize
Hierarchical Clustering
Hierarchical clustering creates a hierarchy of clusters that may be represented in a tree structure called a dendrogram.
The root of the tree consists of a single cluster containing all observations, and the leaves correspond to individual
observations.
Algorithms for hierarchical clustering are generally either agglomerative, in which one starts at the leaves and successively
merges clusters together; or divisive, in which one starts at the root and recursively splits the clusters.
Elements
we can choose
where to cut the
"tree"
Agglomerative
Divisive
start w
indiviadual
objects and
grouping
them
together
Dendrogram
http://wikipedia.org
Distance: Euclidean 2121
2.3. Clustering: k-means
## here we will use previously calculated X from LUSC60
2222
to present 3D data (sample, gene) 2.4. Heatmaps
## Heatmaps. Visualize the most variable genes in LUSC60
#install.packages("pheatmap" )
library(pheatmap)
correlation 0 when 2 patients samples are compared and their difference is heatmapped
2424
Take Home Messages 2
Always check the distribution of your data! It can help you decide about pre-processing
(log-transformation, normalization) and identify outliers.
Use PCA and correlation to identify outliers or strangely behaving samples. PCA can also
show you the effects of experimental factors
Heatmap is a nice tool to visualize the expression of genes over the samples.
Use a heatmap of correlations to check similarities and groups in your samples.
2525
Part 3
3. Statistical Basics
http://edu.modas.lu/transcript-seq/part3.html
Null hypothesis
The hypothesis tentatively assumed true in the hypothesis testing procedure, H0 .
For safety reasons, we assume a situation when nothing “interesting” happens as H0
Alternative hypothesis
The hypothesis concluded to be true if the null hypothesis is rejected, Ha
Ha will be a situation when we see something unusual, which requires action
False Negative,
error
False Positive,
error
2929
3.1. Hypothesis Testing: p-value
One-tailed test
A hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its
sampling distribution
H0: 0
Ha: < 0
A Trade Commission (TC) periodically conducts statistical studies designed to test the claims that
manufacturers make about their products. For example, the label on a large can of Hilltop Coffee
states that the can contains 3 pounds of coffee. The TC knows that Hilltop's production process
cannot place exactly 3 pounds of coffee in each can, even if the mean filling weight for the population
of all cans filled is 3 pounds per can. However, as long as the population mean filling weight is at least
3 pounds per can, the rights of consumers will be protected. Thus, the TC interprets the label
information on a large can of coffee as a claim by Hilltop that the population mean filling weight is at
least 3 pounds per can. We will show how the TC can check Hilltop's claim by conducting a lower tail
hypothesis test.
Suppose a sample of n = 36 coffee cans is selected. From the previous studies, it’s
0 = 3 lbm
known that = 0.18 lbm
3030
3.1. Hypothesis Testing: p-value
0 = 3 lbm Suppose a sample of n = 36 coffee cans is selected and m = 2.92 is
observed. From the previous studies, it’s known that = 0.18 lbm
H0: 3 no action
Let’s say: in the extreme case, when =3, we would like to be 99% sure that we make
no mistake, when starting legal actions against Hilltop Coffee. It means that selected
significance level is = 0.01
should be testes OK
3131
3.1. Hypothesis Testing: p-value
Let’s find the probability of observation m for all possible 3. We start from an extreme case
(=3) and then probe all possible > 3. See the behavior of the small probability area around
measured m. What you will get if you summarize its area for all possible 3 ?
## Calculate p-value: one sample mean
p value is level of making the mistake
## parameters
n = 36
mu = 3
| |
m 0
P(m) for all possible 0 is equal to P(x<m) for an extreme case of = 0
3232
3.2. Hypothesis Testing: Multiple Testing
## Why do we need multiple testing correction?
## do FDR adjustment
fdr = p.adjust(pv,"fdr")
table(fdr < 0.05)
3333
3.2. Multiple Testing
False Negative,
error
False Positive,
error
Population Condition
H0 is TRUE H0 is FALSE Total
Accept H0
m–R
Conclusion
U T
(non-significant)
Reject H0
V S R
(significant)
Total m0 m – m0 m
V
FDR E
V S
3535
3.2. Multiple Testing: FDR and FWER
False Discovery Rate: Benjamini & Hochberg
V
Assume we need to perform m = 100 comparisons, FDR E
and select maximum FDR = = 0.05 V S
k
Expected value for FDR < if P( k ) mP( k )
m
k
genral usage p.adjust(pv, method="fdr") Theoretically, the sign should be “≤”.
But for practical reasons it is replaced by “<“
Bonferroni – simple, but too stringent, not recommended mP(k ) adjusted p value
Many factors
We assume that we have several factors
affecting our data. Which factors are
most significant? Which can be
neglected?
ANOVA
example from Partek™
3737
3.3. Linear Models: ANOVA
As part of a long-term study of individuals 65 years of age or older, sociologists and physicians at the Wentworth Medical Center
in upstate New York investigated the relationship between geographic location and depression. A sample of 60 individuals, all in
reasonably good health, was selected; 20 individuals were residents of Florida, 20 were residents of New York, and 20 were
residents of North Carolina. Each of the individuals sampled was given a standardized test to measure depression. The data
collected follow; higher test scores indicate higher levels of depression.
Q: Is the depression level same in all 3 locations?
3838
3.3. Linear Models: ANOVA
http://edu.modas.lu/transcript-seq/part3.html
8 m3
## load data (*)
6
m1 Dep = read.table("depression2.txt",
header=T, sep="\t", as.is=FALSE)
str(Dep)
(*) http://edu.modas.lu/data/txt/depression2.txt
3939
Take Home Messages 3
When doing multiple hypothesis testing and selecting only those elements
which are significant – always use FDR (or other, like FWER) correction! note for exams
the simplest correction – multiply the p-value by the number of genes. Is it still
significant? Use FDR (Benjamini-Hochberg) or FWER (Holm)
DEA detects the genes which have changed mean gene expression
between condition
=> The more data you have, the smaller differences you will be able to see
Several factors can be taken into account in ANOVA approach. This will give
you insight into the significance of each experimental factor but at the same
time will correct batch effects and allow you to answer complex questions
(remember shoes affecting ladies…).
4040
Part 4
Limma – R package for DEA in microarrays or RNA-seq based on linear models. less sensitive but fast
It is similar to t-test / ANOVA but uses all available data for variance estimation, thus it has
higher power when the number of replicates is limited. It assumes a normal distribution of
values for the gene between replicates. Apply it to normalized, log-transformed counts.
edgeR – R package for DEA in RNA-Seq, based on linear models and negative binomial
distribution of counts. Apply to raw counts! generally
Better noise model results in higher power detecting differentially expressed genes. It assumes
a negative-binomial distribution of values for the gene between replicates.
DESeq2 – another R package for DEA in RNA-Seq, based on the negative binomial distribution
of counts. DESeq2 is the most sensitive among others. Apply to raw counts!
Better noise model results in higher power detecting differentially expressed genes. It assumes
a negative-binomial distribution of values for the gene between replicates.
4242
4.2. DEA: Preparing for limma, edgeR, DESeq2
## install packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
Here we should define contrasts: BiocManager::install("limma")
"condition1 – condition2" BiocManager::install("edgeR")
BiocManager::install("DESeq2")
condition 1 – experimental group
## if you wish, you can use my simple warp-up
condition 2 – control group source("http://r.modas.lu/LibDEA.r")
DEA.limma
DEA.edgeR
DEA.DESeq
4343
4.2. DEA: Time Series Experiment
## Let's use limma for a time-series experiment Experiment: A375 cells stimulated by IFNg
## load the data that are in annotated text format
source("http://r.modas.lu/readAMD.r")
mRNA = readAMD("http://edu.modas.lu/data/txt/mrna_ifng.amd.txt",
stringsAsFactors=TRUE,
index.column="GeneSymbol",
sum.func="mean")
str(mRNA)
## Save results
4646
4.3. Gene Over-representation Analysis
Hypergeometrical: distribution of objects taken
Fisher’s exact test: based on from a “box”, without putting them back
hypergeometrical distributions
𝑛 𝑛!
𝐶𝑘𝑛 = 𝐶𝑛𝑘 = =
𝑘 𝑘! 𝑛 − 𝑘 !
4747
4.3. Gene Set Enrichment Analysis
Is the direction of all genes in a category random? last chance as it doesnt need any gene of intresest enriched
last resort for biologists
4848
Take Home Messages 4
If you are looking at a multi-factor / multi-treatment experiment, you may check the
variable genes (F-statistics based) first, and then go for the contrasts.
To find the biological meaning of the significantly regulated genes, please use
enrichment analysis methods linking known functional groups of genes to DEA results.
Enriched categories are usually more robust than individual genes. If you have no
significant genes – check gene sets by GSEA.
String WikiPathways
https://string-db.org/ https://wikipathways.org/
4949
Summary
Raw Data
QA/QC
+ Remove outliers
Normalization
(remove technical artefacts, Visualization and
make data comparable) exploratory analysis
(PCA, clustering)
DEA
Filtering Enrichment
(differential
(remove uninformative Processed Data expression
(GO, functions,
features) TFs, drugs)
analysis)
GSEA
Network Prediction (gene set
reconstruction (signatures for enrichment
(not considered) classification) analysis)
5050