Skip to content

SViswanathanLab/snakemake-RNAseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

snakemake-RNAseq

This pipeline performs a standard RNAseq analysis, including fastQC, STAR alignment, RSEM & salmon quantification.

Directory structure

.
├── config            # Contains sample sheet (samples.tsv) and config file (config.yaml)
├── rules             # Snakemake rules
├── scripts           # Scripts to run each step of RNAseq
├── README.md
└── Snakefile         # Snakemake workflow

Installation

Option 1: Download the package

  • Choose "Download ZIP"
Screenshot 2024-07-09 at 3 37 12 PM
  • The folder named snakemake-RNAseq-main is downloaded.
  • Transfer the folder to users' working directory on argos.
    scp -r path/to/snakemake-RNAseq-main <USER_ID>@argos-stgw2.dfci.harvard.edu:/mnt/storage/home/<USER_ID>/
    
  • Log onto argos:
    ssh USER_ID@argos.dfci.harvard.edu
    
  • Change the name of the folder to snakemake-RNAseq
    mv $HOME/snakemake-RNAseq-main $HOME/snakemake-RNAseq
    

Option 2: git clone

Clone the pipeline using the following command

git clone https://github.com/SViswanathanLab/snakemake-RNAseq.git

Make sure there is a folder named snakemake-RNAseq in users' working directory.

Usage

Instructions for preparing sample sheet

  • Paired-end data is assumed.

  • 5 types of RNAseq data formats are accommodated: .bam, .fastq.gz, .fq.gz, .fastq, .fq

  • The config/samples.csv file is an example sample sheet, modify it as needed so that

    • 1st column: sample names
    • 2nd column: fq1 file names (if using .bam as input, put the name of .fastq.gz files generated from .bam)
    • 3rd column: fq2 file names (if using .bam as input, put the name of .fastq.gz files generated from .bam)
    • 4th column: RNA-seq file format - choose between bam and fq (all inputs need to be in the same file format)

    Each column is separated by one comma. The fq1 & fq2 file names must contain the full sample names.

    For example:

293T-TFE3-1,293T-TFE3-1_R1.fastq.gz,293T-TFE3-1_R2.fastq.gz,bam
293T-TFE3-2,293T-TFE3-2_R1.fastq.gz,293T-TFE3-2_R2.fastq.gz,bam

Input files

  • The RNAseq files for analysis should be copied to data.
    cp -r path/to/<fq_files_folder> $HOME/snakemake-RNAseq/
    
  • Users should change the name of the folder containing fq files into data.
    mv $HOME/snakemake-RNAseq/<fq_files_folder> $HOME/snakemake-RNAseq/data
    

Run snakemake

  • Step 1: Initiate a screen to run snakemake pipeline in the background without requiring terminal connection all the time

    screen -S snakemake-RNAseq
    
  • Step 2: Change into the directory snakemake-RNAseq

    cd $HOME/snakemake-RNAseq
    
  • Step 3: Activate the environment with snakemake installed & install plugin for cluster submission

    source /mnt/storage/apps/Mambaforge-23.1.0-1/etc/profile.d/conda.sh
    conda activate snakemake
    
    pip install snakemake-executor-plugin-cluster-generic
    
  • Step 4: Run snakemake pipeline

    snakemake --unlock
    snakemake --executor cluster-generic --jobs 50 --latency-wait 60 --cluster-generic-submit-cmd "qsub -l h_vmem=256G, -pe pvm 16 -o $HOME/snakemake-RNAseq/joblogs/ -e $HOME/snakemake-RNAseq/joblogs/"
    
    • This step might take long, depending on the sample sizes.
  • Step 4: Detach the screen as needed

    • Press Ctrl+A and then D to detach the screen as needed. Now closing the terminal or losing connection to the cluster should not interrupt the snakemake pipeline.

    • Use screen -ls to see all detached screens.

    • Example:

      Screenshot 2024-01-08 at 4 26 12 PM

      numbers like 27120 and 26844 are <SCREEN-ID> used to track the processes running in that screen

  • Step 5 (Optional): Reattach to a screen

    screen -r <SCREEN-ID>
    
    • If the command execution in the screen is interrupted, users need to rerun Step 4 in the screen to generate all results expected.
    • Press Ctrl+A and then D to detach the screen as needed.
  • Step 6: Delete the screen after done

    screen -S <SCREEN-ID> -X quit
    

Output

  • The results are saved in the folder snakemake-RNAseq/results.
    • snakemake-RNAseq/results/fastqc_results contains the fastqc results.
    • snakemake-RNAseq/results/STAR_results contains the STAR results, and each subfolder is named by the sample name.
    • snakemake-RNAseq/results/salmon_results contains the salmon results, and each subfolder is named by the sample name.
    • snakemake-RNAseq/results/star_wide_countMatrix.csv and snakemake-RNAseq/results/star_wide_countMatrix.Rds contain the star count matrix, with genes as rows and samples as columns, in both .csv and .Rds format.
    • snakemake-RNAseq/results/salmon_wide_TPM_Matrix.csv and snakemake-RNAseq/results/salmon_wide_TPM_Matrix.Rds contain the salmon TPM matrix, with genes as rows and samples as columns, in both .csv and .Rds format.
    • snakemake-RNAseq/results/rsem_geneLevel_wide_TPM_Matrix.csv and snakemake-RNAseq/results/rsem_geneLevel_wide_TPM_Matrix.Rds contain the rsem gene-level TPM matrix, with genes as rows and samples as columns, in both .csv and .Rds format.
    • snakemake-RNAseq/results/rsem_isoformLevel_wide_TPM_Matrix.csv and snakemake-RNAseq/results/rsem_isoformLevel_wide_TPM_Matrix.Rds contain the rsem transcript-level TPM matrix, with genes as rows and samples as columns, in both .csv and .Rds format.
  • snakemake-RNAseq/rsem_ref contains the reference files generated for rsem quantification.
  • snakemake-RNAseq/logs contains the log files for running each step of this analysis, for debugging.
  • snakemake-RNAseq/joblogs contains the log files for job submission, for debugging.

config

config.yaml contains the information about versions of each tool used, reference file paths

  • Module versions (latest ones globally installed on argos):
    • samtools: 1.9
    • fastqc: 0.11.7
    • star: 2.7.10a
    • rsem: 1.3.1
    • salmon: 1.10.1
    • snakemake: 8.15.2
    • snakemake-executor-plugin-cluster-generic: 1.0.9
  • Reference file directory: /mnt/storage/labs/sviswanathan/snakemake_RNAseq_2024/Human_genome_2024/
    • Can be modified as needed

About

Standard RNAseq analysis pipeline using snakemake.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy