I´m working with this dataset(GSE234297) that provides a salmon counts file and a salmon offsets file(GSE234297_gene_raw_counts.txt.gz GSE234297_gene_offset_matrix.txt.gz) . The paper authors use them like this to do DEG:~
# create DGEList
y <- DGEList(counts = counts, samples = samples, genes = genes, group = samples$Disease_status)
# add salmon offsets
y <- scaleOffset(y, offset = as.matrix(salmon_offset))
# filter low count genes
keep <- filterByExpr(y, group = y$samples$group)
table(keep)
y <- y[keep, , keep.lib.sizes=FALSE]
# set design matrix
design <- model.matrix(~ Sex + GC_content + group, data = y$samples)
design
# estimate dispersion parameters using edgeR robust method
y <- estimateGLMRobustDisp(y, design)
But my question is if I want to use the salmon offsets for Wilcoxon (which can't use offsets directly), could I simply do:
Copy <- DGEList(counts)
y$offset <- offsets
cpms <- edgeR::cpm(y, offset = y$offset, log = FALSE)
To get properly normalized values that account for the Salmon bias corrections?
Thank you. What is the difference between setting robust=TRUE in glmQLFit and estimateGLMRobustDisp? The paper i mentioned in the post used both:
It is not possible to use both. estimateGLMRobustDisp() estimates negative binomial dispersions for each gene, whereas glmQLFit() estimates quasi-dispersions. estimateGLMRobustDisp() is for the older edgeR glm pipeline based on likelihood ratio tests. If you call the newer pipeline with glmQLFit(), then the gene-specific negative binomial dispersions will be discarded. See Chen et al (2024) for the differences between the pipelines.
How do you know that the paper used robust=TRUE? I see no mention of that in the published paper associated with the GSE234297 dataset.
Chen Y, Chen L, Lun ATL, Baldoni PL, Smyth GK (2024). edgeR 4.0: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. bioRxiv 2024.01.21.576131. doi:10.1101/2024.01.21.576131
Edit later
I have now remembered that estimateGLMRobustDisp() produces weights as well as NB dispersion, and the weights will definitely affect any quasi-likelihood analysis done downstream. I can't recommend combining estimateGLMRobustDisp with glmQLFit. The edgeR authors didn't intend that the two pipelines be combined in this way.
They provide the code they used for DGE analysis in a gitlab link that can be found in the paper. In there you can find:
Here, estimateGLMRobustDisp() was used just to get the diagnostic BCV plot by plotBCV(y). The robust NB dispersions were not used for the DE analysis.
In edgeR v3, used by the paper, glmQLFit() would have taken the trended NB dispersions from the estimateGLMRobustDisp() output, which would be same trended dispersions as from the standard dispersion estimate command estimateDisp(). In edgeR v4, the NB dispersions are being ignored entirely by glmQLFit() so you can simply skip estimateDisp() and friends, resulting in a considerable saving of computation time, because the NB dispersion steps is computationally expensive.
For a dataset with 100-200 samples, the whole edgeR v4 analysis should take well under a minute.
Edit later
I remember now that estimateGLMRobustDisp() produces weights, which will be used by the glmQLFit analysis. We don't recommend that sequence though.
The estimateGLMRobustDisp() function seems to have an effect on glmQLFit(), even in edgeR 4 (4.4.1). If I use estimateGLMRobustDisp(), like this:
I get 45 differentially expressed genes. However, if I omit estimateGLMRobustDisp():
Also, there's quite a big difference between the results obtained for this dataset with edgeR 3 using the same code as in the paper (245 differentially expressed genes) and using the new QL pipeline in edgeR 4 (0 differentially expressed genes).
Sorry, I had forgotten estimateGLMDisp() produces weights as well as NB dispersions. The weights from estimatGLMRobustDisp() were intended to be used by adjustedProfileLik() to estimate the NB dispersions. However, if you pass the DGE object to glmQLFit(), then glmQLFit() will interpret the weights a generalized linear model weights, with possibly unintended results. The edgeR authors did not intend that estimateGLMRobustDisp() be combined with glmQLFit().
BTW, I am a bit confused by the code in your comment. You have given two code sequences, but they both look to be exactly the same.
I edited the previous post. I meant to remove the estimateGLMRobustDisp line in the second snippet :