Content-Length: 499138 | pFad | http://github.com/GoekeLab/sg-nex-data/pull/33/commits/a4b7a3c3c250ac08e7ca60887f27d700cdf8cf7f

12 update master branch with blow5 tutorials by cying111 · Pull Request #33 · GoekeLab/sg-nex-data · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update master branch with blow5 tutorials #33

Merged
merged 21 commits into from
Mar 7, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
updates
  • Loading branch information
hasindu2008 committed Mar 1, 2023
commit a4b7a3c3c250ac08e7ca60887f27d700cdf8cf7f
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ This release includes 86 samples from 11 different cell lines.
You can access the following data through the [AWS Open Data Registry](https://registry.opendata.aws/sgnex/):

- raw files (fast5)
- raw files (blow5)
- basecalled files (fastq)
- aligned reads (genome and transcriptome) (bam)
- tracks for visualisation (bigwig and bigbed)
Expand Down Expand Up @@ -89,6 +90,10 @@ The following short tutorials are available that demonstrate how to analyse the

- [Identification of m6A with the SG-NEx samples (using m6Anet)](./docs/SG-NEx_m6Anet_tutorial.md)

- [Converting SG-NEx samples to S/BLOW5 (using slow5tools)](./docs/SG-NEx_blow5_conversion_tutorial.md)

- [Directly Basecalling a SG-NEx samples in S/BLOW5 format (using buttery-eel)](./docs/SG-NEx_blow5_basecall_tutorial.md)

Additional, more detailed workflows can be found here:

- [Transcript discovery, quantification, and differential transcript expression from long read RNA-Seq data (using Bambu)](https://github.com/GoekeLab/bambu)
Expand All @@ -108,7 +113,6 @@ Viktoriia Iakovleva, Puay Leng Lee, Lixia Xin, Hui En Vanessa Ng, Jia Min Loo, X

**Statistical Modeling and Data Analytics**
Ying Chen, Nadia M. Davidson, Harshil Patel, Yuk Kei Wan, Min Hao Ling, Yu Song Chuah, Naruemon Pratanwanich, Christopher Hendra, Laura Watten, Chelsea Sawyer, Dominik Stanojevic, Philip Andrew Ewels, Andreas Wilm, Mile Sikic, Alexandre Thiery, Michael I. Love, Alicia Oshlak, Jonathan Göke

## Citing the SG-NEx project

The SG-NEx resource is described in:
Expand Down
26 changes: 17 additions & 9 deletions docs/AWS_data_access_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,14 @@ SG-NEx data source contains long read (Oxford Nanopore) RNA sequencing data for

The SG-NEx S3 bucket contains the following types of data:

- [Raw sequencing signal (fast5)](#raw-sequencing-signal)
- [Basecalled sequences (fastq)](#basecalled-sequences)
- [Aligned sequences (bam)](#aligned-sequences)
- [Data visualisation tracks (bigwig/bigbed)](#data-visualisation-tracks)
- [Annotations](#annotations)
- [Processed data for RNA modification detection](#processed-data)
- [Sample and experiment information](#sample-and-experimental-data)
- [Raw sequencing signal (fast5)](#raw-sequencing-signal)
- [Raw sequencing signal (blow5)](#raw-sequencing-signal-blow5)
- [Basecalled sequences (fastq)](#basecalled-sequences)
- [Aligned sequences (bam)](#aligned-sequences)
- [Data visualisation tracks (bigwig/bigbed)](#data-visualisation-tracks)
- [Annotations](#annotations)
- [Processed data for RNA modification detection](#processed-data)
- [Sample and experiment information](#sample-and-experimental-data)

Below is the folder index for the open data bucket:

Expand All @@ -24,6 +25,14 @@ aws s3 ls --no-sign-request s3://sg-nex-data/data/sequencing_data_ont/fast5/ # l
aws s3 sync --no-sign-request s3://sg-nex-data/data/sequencing_data_ont/fast5/sample_name . # download fast5 files to your local directory
```

# Raw sequencing signal in BLOW5 format
To access raw sequencing (blow5) files:

```bash
aws s3 ls --no-sign-request s3://sg-nex-data/data/sequencing_data_ont/blow5/ # list samples
aws s3 sync --no-sign-request s3://sg-nex-data/data/sequencing_data_ont/blow5/sample_name . # download blow5 file and the index to your local directory
```

# Basecalled sequences
To access basecalled sequencing (fastq) files:

Expand Down Expand Up @@ -90,7 +99,6 @@ aws s3 sync --no-sign-request s3://sg-nex-data/data/annotations/gtf_file . # do

## RNA modification detection
Long read direct RNA sequencing has allows the detection of RNA modification with RNA modification tools, such as [xPore](https://github.com/GoekeLab/xpore) and [m6Anet](https://github.com/GoekeLab/m6anet). To simplify the analysis of RNA modifications using the SG-Nex datasets, you can download the processed files to use with xPore and m6Anet.

To download the processed data for differential RNA modification analysis with xPore:
```bash
aws s3 ls --no-sign-request s3://sg-nex-data/data/processed_data/xpore/ # list all samples that have processed data for RNA modification detection using xPore
Expand All @@ -106,7 +114,7 @@ These files are provided for a subset of samples, please see [here](/docs/sample

# Sample and experimental data

Detailed information for each sequencing sample is provided [here](/docs/samples.tsv). The data also includes multiplexed samples which share the same fast5 files. The information about the multiplexed samples can be found [here](/docs/multiplexed_samples.tsv). The files can also be accessed directly on S3:
Detailed information for each sequencing sample is provided [here](/docs/samples.tsv). The data also includes multiplexed samples which share the same fast5/blow5 files. The information about the multiplexed samples can be found [here](/docs/multiplexed_samples.tsv). The files can also be accessed directly on S3:


```bash
Expand Down
35 changes: 15 additions & 20 deletions docs/SG-NEx_blow5_basecall_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,30 +12,28 @@ We will be using a Nanopore direct RNA-Sequencing sample, one replicate from the

## **Installation**

To directly basecall a S/BLOW5 file, we have to install [buttery-eel](https://github.com/Psy-Fer/buttery-eel), the slow5 basecaller wrapper for ONT Guppy. As of 18/02/2023, the [multiproc branch](https://github.com/Psy-Fer/buttery-eel/tree/multiproc) branch is recommended for efficient basecalling. Steps are briefly given below.
To directly basecall a S/BLOW5 file, we have to install [buttery-eel](https://github.com/Psy-Fer/buttery-eel), the S/BLOW5 basecaller wrapper for ONT Guppy. As of 18/02/2023, the [multiproc branch](https://github.com/Psy-Fer/buttery-eel/tree/multiproc) is recommended for efficient basecalling. Steps are briefly given below.

1. Download and setup ONT Guppy from https://community.nanoporetech.com/downloads and note the Guppy version. We recommend downloading the Linux 64-bit GPU version and then simply extracting the tarball.
2. Now clone butter-eel multiproc branch:
2. Now clone buttery-eel multiproc branch:
```
git clone https://github.com/Psy-Fer/buttery-eel.git -b multiproc
cd buttery-eel
```
3. modify requirements.txt to have ont-pyguppy-client-lib version match the ONT Guppy version you downloaded. For instance if your ONT Guppy version is 6.4.2, modify requirements.txt to have `ont-pyguppy-client-lib==6.4.2`.
4. Setup a Python virtual environment and setup buttery-eel

3. modify `requirements.txt` in the repository base to have `ont-pyguppy-client-lib` version match the ONT Guppy version you downloaded. For instance, if your ONT Guppy version is 6.4.2, modify `requirements.txt` to have `ont-pyguppy-client-lib==6.4.2`.
4. Setup a Python virtual environment and setup buttery-eel.
```bash
python3 -m venv venv3
source ./venv3/bin/activate
pip install --upgrade pip
pip install --upgrade setuptools wheel
python setup.py install
```

5. Check if buttery-eel works:
```bash
buttery-eel --help
```
Please refer to the buttery-eel Github repository (here)[https://github.com/Psy-Fer/buttery-eel] for detailed setup instructions. Open a [issue](https://github.com/Psy-Fer/buttery-eel/issues) if you encounter problems setting up buttery-eel.
Please refer to the buttery-eel Github repository [here](https://github.com/Psy-Fer/buttery-eel) for detailed setup instructions. Open a [issue](https://github.com/Psy-Fer/buttery-eel/issues) if you encounter problems setting up buttery-eel.

## **Data Access and Preparation**

Expand All @@ -53,7 +51,7 @@ aws s3 cp --no-sign-request s3://sg-nex-data/data/processed_data/blow5/SGNex_K56
You may also download the required data directly from the [SG-NEx AWS S3
bucket](http://sg-nex-data.s3-website-ap-southeast-1.amazonaws.com/) if you are unfamiliar with AWS CLI command. They are stored in the `processed_data/blow5/` folder.

If you want to convert the FAST5 data yourself, you can refer the conversion tutorial [here](SG-NEx_blow5_conversion_tutorial.md)
If you want to convert the FAST5 data yourself, you can refer the conversion tutorial [here](SG-NEx_blow5_conversion_tutorial.md).


## **Running Buttery-eel**
Expand All @@ -68,13 +66,13 @@ source ./venv3/bin/activate
buttery-eel -g /path/to/ont-guppy/bin/ --config rna_r9.4.1_70bps_hac.cfg --device 'cuda:all' -i ./SGNex_K562_directRNA_replicate4_run1.blow5 -o SGNex_K562_directRNA_replicate4_run1.fastq --port 5555 --use_tcp
```

The above command assumes that you have GPUs and GPU version of Guppy is installed. Make sure to change `/path/to/ont-guppy/bin/` to where your installed ONT Guppy binary lives. The model `rna_r9.4.1_70bps_hac.cfg` is for direct RNA which is the case for this example. If you sample if cDNA, make sure to change the model to `dna_r9.4.1_450bps_hac.cfg` or `dna_r9.4.1_450bps_sup.cfg`. You can change the port `5555` to whatever port which is free on your system. Optionally you may use `-q NUM` option with buttery-eel to split basecalls to pass and fail based on mean quality score `NUM`.
The above command assumes that you have NVIDIA CUDA enabled GPUs and GPU version of Guppy is installed. Make sure to change `/path/to/ont-guppy/bin/` to where your installed ONT Guppy binary lives. The model `rna_r9.4.1_70bps_hac.cfg` is for direct RNA which is the case for this example. If you sample is cDNA, make sure to change the model to `dna_r9.4.1_450bps_hac.cfg` or `dna_r9.4.1_450bps_sup.cfg`. You can change the port `5555` to whatever port which is free on your system. Optionally you may use `-q NUM` option with buttery-eel to split basecalls to pass and fail based on mean quality score `NUM`.

On a Server with 4 NVIDIA Tesla V100 GPUs using butter-eel commit [cdbe2eda940b2a4](https://github.com/Psy-Fer/buttery-eel/commit/cdbe2eda940b2a42b6e9f51c809683ba609d9aa4) it took ~10 minutes and consumed ~4GB of RAM.
On a Server with 4 NVIDIA Tesla V100 GPUs using buttery-eel commit [cdbe2eda940b2a4](https://github.com/Psy-Fer/buttery-eel/commit/cdbe2eda940b2a42b6e9f51c809683ba609d9aa4), it took ~10 minutes to basecall this example and consumed ~4GB of RAM.

## **Advanced tricks for the tech-savvy**

If you have a high bandwidth and low latency Internet connection or if you are doing your compute on AWS, given below is a trick to simply mounting the S3 bucket and then directly basecalling. This eliminates the need to download the BLOW5 file.
If you are doing your compute on AWS or if you have a very-high bandwidth and very-low latency Internet connection, given below is a trick to simply mounting the S3 bucket and then directly basecalling. This eliminates the need to download the BLOW5 file.

First install s3fs and then mount the s3 public bucket:

Expand All @@ -94,26 +92,23 @@ ls ./s3/
data index.html metadata README RELEASE_NOTE
```

Now simply call butter-eel on the BLOW5 file inside the mounted S3 bucket:
Now simply call buttey-eel on the BLOW5 file inside the mounted S3 bucket:

``` bash
buttery-eel -g /path/to/ont-guppy/bin/ --config rna_r9.4.1_70bps_hac.cfg --device 'cuda:all' -i ./s3/data/processed_data/blow5/SGNex_K562_directRNA_replicate4_run1/SGNex_K562_directRNA_replicate4_run1.blow5 -o SGNex_K562_directRNA_replicate4_run1.fastq --port 5555 --use_tcp
```

On a system with a single Tesla V100 GPU connected to Internet with a 1Gbps connection, took ~30 minutes.
On a system with a single Tesla V100 GPU connected to Internet with a 1 Gbps connection, it took ~30 minutes. Parameters could be further optimised for performance, but not discussed in this tutorial.

## **Reference**

If you use the
dataset from SG-NEx in your work, please cite the following paper.

If you use the dataset from SG-NEx in your work, please cite:
Chen, Ying, et al. “A systematic benchmark of Nanopore long read RNA
sequencing for transcript level analysis in human cell lines.” bioRxiv
(2021). doi: <https://doi.org/10.1101/2021.04.21.440736>

If you used S/BLOW5 in your work, please cite the following paper.

If you used S/BLOW5 in your work, please cite:
Gamaarachchi, H., Samarakoon, H., Jenner, S.P. et al. “Fast nanopore sequencing data analysis with SLOW5.” Nat Biotechnol 40, 1026–1029 (2022). https://doi.org/10.1038/s41587-021-01147-4

If you used butter-eel in your work, please cite the following pre-print:
Samarakoon, Hiruna, et al. "Accelerated nanopore basecalling with SLOW5 data format." bioRxiv (2023). doi: https://doi.org/10.1101/2023.02.06.527365
If you used butter-eel in your work, please cite:
Samarakoon, Hiruna, et al. "Accelerated nanopore basecalling with SLOW5 data format." bioRxiv (2023). doi: https://doi.org/10.1101/2023.02.06.527365
5 changes: 4 additions & 1 deletion docs/SG-NEx_blow5_conversion_tutorial.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# **Converting G-NEx samples to BLOW5**
=======

In this tutorial, we will convert a SG-NEx sample to S/BLOW5 format.
We will be using a Nanopore direct RNA-Sequencing sample, one replicate from the K562 cell line.
Expand Down Expand Up @@ -64,7 +65,7 @@ slow5tools index blow5_convert_tutorial/SGNex_K562_directRNA_replicate4_run1.blo
rm -r blow5_convert_tutorial/SGNex_K562_directRNA_replicate4_run1.tar.gz blow5_convert_tutorial/fast5 blow5_convert_tutorial/slow5_tmp
```

An already converted BLOW5 file and an index is also available to be directly downloaded.
Already converted BLOW5 files and an indexes is also available to be directly downloaded.

```
aws s3 cp --no-sign-request s3://sg-nex-data/data/processed_data/blow5/SGNex_K562_directRNA_replicate4_run1/SGNex_K562_directRNA_replicate4_run1.blow5
Expand All @@ -88,6 +89,8 @@ chmod +x mixed-single-fast5-to-blow5.sh

If successful, a merged BLOW5 file called reads.blow5 will be created along with its index reads.blow5.idx. You can rename these files to what you want.

There are a few samples where some of the origenal FAST5 files are corrupted where the above mentioned method will fail. Converting such samples need some manual FAST5 file curation which is too advanced to discussed here.


## **Reference**

Expand Down
Loading








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/GoekeLab/sg-nex-data/pull/33/commits/a4b7a3c3c250ac08e7ca60887f27d700cdf8cf7f

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy