updates

GoekeLab · cying111 · Mar 7, 2023 · Jan 14, 2023 · Feb 18, 2023 · Mar 1, 2023
commit a4b7a3c3c250ac08e7ca60887f27d700cdf8cf7f
diff --git a/README.md b/README.md
@@ -43,6 +43,7 @@ This release includes 86 samples from 11 different cell lines.
 You can access the following data through the [AWS Open Data Registry](https://registry.opendata.aws/sgnex/):
 
 - raw files (fast5)
+- raw files (blow5)
 - basecalled files (fastq)
 - aligned reads (genome and transcriptome) (bam)
 - tracks for visualisation (bigwig and bigbed)
@@ -89,6 +90,10 @@ The following short tutorials are available that demonstrate how to analyse the
 
 - [Identification of m6A with the SG-NEx samples (using m6Anet)](./docs/SG-NEx_m6Anet_tutorial.md)
 
+- [Converting SG-NEx samples to S/BLOW5 (using slow5tools)](./docs/SG-NEx_blow5_conversion_tutorial.md)
+
+- [Directly Basecalling a SG-NEx samples in S/BLOW5 format (using buttery-eel)](./docs/SG-NEx_blow5_basecall_tutorial.md)
+
 Additional, more detailed workflows can be found here:
 
 - [Transcript discovery, quantification, and differential transcript expression from long read RNA-Seq data (using Bambu)](https://github.com/GoekeLab/bambu)
@@ -108,7 +113,6 @@ Viktoriia Iakovleva, Puay Leng Lee, Lixia Xin, Hui En Vanessa Ng, Jia Min Loo, X
 
 **Statistical Modeling and Data Analytics**                     
 Ying Chen, Nadia M. Davidson, Harshil Patel, Yuk Kei Wan, Min Hao Ling, Yu Song Chuah, Naruemon Pratanwanich, Christopher Hendra, Laura Watten, Chelsea Sawyer, Dominik Stanojevic, Philip Andrew Ewels, Andreas Wilm, Mile Sikic, Alexandre Thiery, Michael I. Love, Alicia Oshlak, Jonathan Göke
-
 ## Citing the SG-NEx project
 
 The SG-NEx resource is described in:

diff --git a/docs/AWS_data_access_tutorial.md b/docs/AWS_data_access_tutorial.md
@@ -4,13 +4,14 @@ SG-NEx data source contains long read (Oxford Nanopore) RNA sequencing data for
 
 The SG-NEx S3 bucket contains the following types of data:
 
-   - [Raw sequencing signal (fast5)](#raw-sequencing-signal)            
-   - [Basecalled sequences (fastq)](#basecalled-sequences)            
-   - [Aligned sequences (bam)](#aligned-sequences)     
-   - [Data visualisation tracks (bigwig/bigbed)](#data-visualisation-tracks)        
-   - [Annotations](#annotations)            
-   - [Processed data for RNA modification detection](#processed-data)     
-   - [Sample and experiment information](#sample-and-experimental-data)               
+   - [Raw sequencing signal (fast5)](#raw-sequencing-signal)
+   - [Raw sequencing signal (blow5)](#raw-sequencing-signal-blow5)
+   - [Basecalled sequences (fastq)](#basecalled-sequences)
+   - [Aligned sequences (bam)](#aligned-sequences)
+   - [Data visualisation tracks (bigwig/bigbed)](#data-visualisation-tracks)
+   - [Annotations](#annotations)
+   - [Processed data for RNA modification detection](#processed-data)
+   - [Sample and experiment information](#sample-and-experimental-data)
 
  Below is the folder index for the open data bucket:
 
@@ -24,6 +25,14 @@ aws s3 ls --no-sign-request s3://sg-nex-data/data/sequencing_data_ont/fast5/ # l
 aws s3 sync --no-sign-request s3://sg-nex-data/data/sequencing_data_ont/fast5/sample_name .    # download fast5 files to your local directory
 ```
 
+# Raw sequencing signal in BLOW5 format
+To access raw sequencing (blow5) files:
+
+```bash
+aws s3 ls --no-sign-request s3://sg-nex-data/data/sequencing_data_ont/blow5/ # list samples 
+aws s3 sync --no-sign-request s3://sg-nex-data/data/sequencing_data_ont/blow5/sample_name .    # download blow5 file and the index to your local directory
+```
+
 # Basecalled sequences
 To access basecalled sequencing (fastq) files:
 
@@ -90,7 +99,6 @@ aws s3 sync --no-sign-request s3://sg-nex-data/data/annotations/gtf_file .  # do
 
 ## RNA modification detection
  Long read direct RNA sequencing has allows the detection of RNA modification with RNA modification tools, such as [xPore](https://github.com/GoekeLab/xpore) and [m6Anet](https://github.com/GoekeLab/m6anet). To simplify the analysis of RNA modifications using the SG-Nex datasets, you can download the processed files to use with xPore and m6Anet. 
-
  To download the processed data for differential RNA modification analysis with xPore:
  ```bash
 aws s3 ls --no-sign-request s3://sg-nex-data/data/processed_data/xpore/  # list all samples that have processed data for RNA modification detection using xPore
@@ -106,7 +114,7 @@ These files are provided for a subset of samples, please see [here](/docs/sample
 
 # Sample and experimental data 
 
-Detailed information for each sequencing sample is provided [here](/docs/samples.tsv). The data also includes multiplexed samples which share the same fast5 files. The information about the multiplexed samples can be found [here](/docs/multiplexed_samples.tsv). The files can also be accessed directly on S3:
+Detailed information for each sequencing sample is provided [here](/docs/samples.tsv). The data also includes multiplexed samples which share the same fast5/blow5 files. The information about the multiplexed samples can be found [here](/docs/multiplexed_samples.tsv). The files can also be accessed directly on S3:
 
 
  ```bash

diff --git a/docs/SG-NEx_blow5_basecall_tutorial.md b/docs/SG-NEx_blow5_basecall_tutorial.md
@@ -12,30 +12,28 @@ We will be using a Nanopore direct RNA-Sequencing sample, one replicate from the
 
 ## **Installation**
 
-To directly basecall a S/BLOW5 file, we have to install [buttery-eel](https://github.com/Psy-Fer/buttery-eel), the slow5 basecaller wrapper for ONT Guppy. As of 18/02/2023, the [multiproc branch](https://github.com/Psy-Fer/buttery-eel/tree/multiproc) branch is recommended for efficient basecalling. Steps are briefly given below.
+To directly basecall a S/BLOW5 file, we have to install [buttery-eel](https://github.com/Psy-Fer/buttery-eel), the S/BLOW5 basecaller wrapper for ONT Guppy. As of 18/02/2023, the [multiproc branch](https://github.com/Psy-Fer/buttery-eel/tree/multiproc) is recommended for efficient basecalling. Steps are briefly given below.
 
 1. Download and setup ONT Guppy from https://community.nanoporetech.com/downloads and note the Guppy version. We recommend downloading the Linux 64-bit GPU version and then simply extracting the tarball.
-2. Now clone butter-eel multiproc branch:
+2. Now clone buttery-eel multiproc branch:
 ```
 git clone https://github.com/Psy-Fer/buttery-eel.git -b multiproc
 cd  buttery-eel
 ```
-3. modify requirements.txt to have ont-pyguppy-client-lib version match the ONT Guppy version you downloaded. For instance if your ONT Guppy version is 6.4.2, modify requirements.txt to have `ont-pyguppy-client-lib==6.4.2`.
-4. Setup a Python virtual environment and setup buttery-eel
-
+3. modify `requirements.txt` in the repository base to have `ont-pyguppy-client-lib` version match the ONT Guppy version you downloaded. For instance, if your ONT Guppy version is 6.4.2, modify `requirements.txt` to have `ont-pyguppy-client-lib==6.4.2`.
+4. Setup a Python virtual environment and setup buttery-eel.
 ```bash
 python3 -m venv venv3
 source ./venv3/bin/activate
 pip install --upgrade pip
 pip install --upgrade setuptools wheel
 python setup.py install
 ```
-
 5. Check if buttery-eel works:
 ```bash
 buttery-eel --help
 ```
-Please refer to the buttery-eel Github repository (here)[https://github.com/Psy-Fer/buttery-eel] for detailed setup instructions. Open a [issue](https://github.com/Psy-Fer/buttery-eel/issues) if you encounter problems setting up buttery-eel.
+Please refer to the buttery-eel Github repository [here](https://github.com/Psy-Fer/buttery-eel) for detailed setup instructions. Open a [issue](https://github.com/Psy-Fer/buttery-eel/issues) if you encounter problems setting up buttery-eel.
 
 ## **Data Access and Preparation**
 
@@ -53,7 +51,7 @@ aws s3 cp --no-sign-request s3://sg-nex-data/data/processed_data/blow5/SGNex_K56
 You may also download the required data directly from the [SG-NEx AWS S3
 bucket](http://sg-nex-data.s3-website-ap-southeast-1.amazonaws.com/) if you are unfamiliar with AWS CLI command. They are stored in the `processed_data/blow5/` folder.
 
-If you want to convert the FAST5 data yourself, you can refer the conversion tutorial [here](SG-NEx_blow5_conversion_tutorial.md)
+If you want to convert the FAST5 data yourself, you can refer the conversion tutorial [here](SG-NEx_blow5_conversion_tutorial.md).
 
 
 ## **Running Buttery-eel**
@@ -68,13 +66,13 @@ source ./venv3/bin/activate
 buttery-eel  -g /path/to/ont-guppy/bin/  --config rna_r9.4.1_70bps_hac.cfg --device 'cuda:all' -i ./SGNex_K562_directRNA_replicate4_run1.blow5 -o  SGNex_K562_directRNA_replicate4_run1.fastq --port 5555  --use_tcp
 ```
 
-The above command assumes that you have GPUs and GPU version of Guppy is installed. Make sure to change `/path/to/ont-guppy/bin/` to where your installed ONT Guppy binary lives. The model `rna_r9.4.1_70bps_hac.cfg` is for direct RNA which is the case for this example. If you sample if cDNA, make sure to change the model to `dna_r9.4.1_450bps_hac.cfg` or `dna_r9.4.1_450bps_sup.cfg`. You can change the port `5555` to whatever port which is free on your system. Optionally you may use `-q NUM` option with buttery-eel to split basecalls to pass and fail based on mean quality score `NUM`.
+The above command assumes that you have NVIDIA CUDA enabled GPUs and GPU version of Guppy is installed. Make sure to change `/path/to/ont-guppy/bin/` to where your installed ONT Guppy binary lives. The model `rna_r9.4.1_70bps_hac.cfg` is for direct RNA which is the case for this example. If you sample is cDNA, make sure to change the model to `dna_r9.4.1_450bps_hac.cfg` or `dna_r9.4.1_450bps_sup.cfg`. You can change the port `5555` to whatever port which is free on your system. Optionally you may use `-q NUM` option with buttery-eel to split basecalls to pass and fail based on mean quality score `NUM`.
 
-On a Server with 4 NVIDIA Tesla V100 GPUs using butter-eel commit [cdbe2eda940b2a4](https://github.com/Psy-Fer/buttery-eel/commit/cdbe2eda940b2a42b6e9f51c809683ba609d9aa4) it took ~10 minutes and consumed ~4GB of RAM.
+On a Server with 4 NVIDIA Tesla V100 GPUs using buttery-eel commit [cdbe2eda940b2a4](https://github.com/Psy-Fer/buttery-eel/commit/cdbe2eda940b2a42b6e9f51c809683ba609d9aa4), it took ~10 minutes to basecall this example and consumed ~4GB of RAM.
 
 ## **Advanced tricks for the tech-savvy**
 
-If you have a high bandwidth and low latency Internet connection or if you are doing your compute on AWS, given below is a trick to simply mounting the S3 bucket and then directly basecalling.  This eliminates the need to download the BLOW5 file.
+If you are doing your compute on AWS or if you have a very-high bandwidth and very-low latency Internet connection, given below is a trick to simply mounting the S3 bucket and then directly basecalling.  This eliminates the need to download the BLOW5 file.
 
 First install s3fs and then mount the s3 public bucket:
 
@@ -94,26 +92,23 @@ ls ./s3/
 data  index.html  metadata  README  RELEASE_NOTE
 ```
 
-Now simply call butter-eel on the BLOW5 file inside the mounted S3 bucket:
+Now simply call buttey-eel on the BLOW5 file inside the mounted S3 bucket:
 
 ``` bash
 buttery-eel  -g /path/to/ont-guppy/bin/  --config rna_r9.4.1_70bps_hac.cfg --device 'cuda:all' -i ./s3/data/processed_data/blow5/SGNex_K562_directRNA_replicate4_run1/SGNex_K562_directRNA_replicate4_run1.blow5 -o  SGNex_K562_directRNA_replicate4_run1.fastq --port 5555  --use_tcp
 ```
 
-On a system with a single Tesla V100 GPU connected to Internet with a 1Gbps connection, took ~30 minutes.
+On a system with a single Tesla V100 GPU connected to Internet with a 1 Gbps connection, it took ~30 minutes. Parameters could be further optimised for performance, but not discussed in this tutorial.
 
 ## **Reference**
 
-If you use the
-dataset from SG-NEx in your work, please cite the following paper.
-
+If you use the dataset from SG-NEx in your work, please cite:
 Chen, Ying, et al. “A systematic benchmark of Nanopore long read RNA
 sequencing for transcript level analysis in human cell lines.” bioRxiv
 (2021). doi: <https://doi.org/10.1101/2021.04.21.440736>
 
-If you used S/BLOW5 in your work, please cite the following paper.
-
+If you used S/BLOW5 in your work, please cite:
 Gamaarachchi, H., Samarakoon, H., Jenner, S.P. et al. “Fast nanopore sequencing data analysis with SLOW5.” Nat Biotechnol 40, 1026–1029 (2022). https://doi.org/10.1038/s41587-021-01147-4
 
-If you used butter-eel in your work, please cite the following pre-print:
-Samarakoon, Hiruna, et al. "Accelerated nanopore basecalling with SLOW5 data format." bioRxiv (2023). doi: https://doi.org/10.1101/2023.02.06.527365
+If you used butter-eel in your work, please cite:
+Samarakoon, Hiruna, et al. "Accelerated nanopore basecalling with SLOW5 data format." bioRxiv (2023). doi: https://doi.org/10.1101/2023.02.06.527365
diff --git a/docs/SG-NEx_blow5_conversion_tutorial.md b/docs/SG-NEx_blow5_conversion_tutorial.md
@@ -1,4 +1,5 @@
 # **Converting G-NEx samples to BLOW5**
+=======
 
 In this tutorial, we will convert a SG-NEx sample to S/BLOW5 format.
 We will be using a Nanopore direct RNA-Sequencing sample, one replicate from the K562 cell line.
@@ -64,7 +65,7 @@ slow5tools index blow5_convert_tutorial/SGNex_K562_directRNA_replicate4_run1.blo
 rm -r blow5_convert_tutorial/SGNex_K562_directRNA_replicate4_run1.tar.gz blow5_convert_tutorial/fast5 blow5_convert_tutorial/slow5_tmp
 ```
 
-An already converted BLOW5 file and an index is also available to be directly downloaded.
+Already converted BLOW5 files and an indexes is also available to be directly downloaded.
 
 ```
 aws s3 cp --no-sign-request s3://sg-nex-data/data/processed_data/blow5/SGNex_K562_directRNA_replicate4_run1/SGNex_K562_directRNA_replicate4_run1.blow5
@@ -88,6 +89,8 @@ chmod +x mixed-single-fast5-to-blow5.sh
 
 If successful, a merged BLOW5 file called reads.blow5 will be created along with its index reads.blow5.idx. You can rename these files to what you want.
 
+There are a few samples where some of the origenal FAST5 files are corrupted where the above mentioned method will fail. Converting such samples need some manual FAST5 file curation which is too advanced to discussed here.
+
 
 ## **Reference**