Next generation sequencing (NGS) generates massive amounts of data and its analysis requires acce... more Next generation sequencing (NGS) generates massive amounts of data and its analysis requires access to powerful computational infrastructure, high quality bioinformatics software, and personnel skilled to operate the tools. We present a practical solution to this data management and analysis challenge by using the Globus Genomics platform that simplifies large-scale data handling and provides advanced tools for NGS data analysis. Globus Genomics uses Globus Transfer to allow seamless data transfer from distributed data endpoints such as sequencing centers into its analytical platform for immediate analysis. It hosts an enhanced Galaxy instance on Amazon Web Service (AWS) resources with hundreds of widely used NGS analysis tools and many pre-defined best practices pipelines for whole genome/exome, RNA-seq, or ChIP-seq data analysis. Unlimited scalability and enabling simultaneous analysis of numerous data sets is possible due to the platforms ability to provision on-demand compute cl...
Proceedings of the Conference on Extreme Science and Engineering Discovery Environment Gateway to Discovery - XSEDE '13, 2013
ABSTRACT We describe Globus Genomics, a system that we have developed for rapid analysis of large... more ABSTRACT We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing (NGS) genomic data. This system is notable for its high degree of end-to-end automation, which encompasses every stage of the data analysis pipeline from initial data access (from remote sequencing center or database, by the Globus Online file transfer system) to on-demand resource acquisition (on Amazon EC2, via the Globus Provision cloud manager); specification, configuration, and reuse of multi-step processing pipelines (via the Galaxy workflow system); and efficient scheduling of these pipelines over many processors (via the Condor scheduler). The system allows biomedical researchers to perform rapid analysis of large NGS datasets using just a web browser in a fully automated manner, without software installation.
2012 IEEE 8th International Conference on E-Science, 2012
ABSTRACT In the past few years in the biomedical field, availability of low-cost sequencing metho... more ABSTRACT In the past few years in the biomedical field, availability of low-cost sequencing methods in the form of next-generation sequencing has revolutionized the approaches life science researchers are undertaking in order to gain a better understanding of the causative factors of diseases. With biomedical researchers getting many of their patients' DNA and RNA sequenced, sequencing centers are working with hundreds of researchers with terabytes to petabytes of data for each researcher. The unprecedented scale at which genomic sequence data is generated today by high-throughput technologies requires sophisticated and high-performance methods of data handling and management. For the most part, however, the state of the art is to use hard disks to ship the data. As data volumes reach tens or even hundreds of terabytes, such approaches become increasingly impractical. Data stored on portable media can be easily lost, and typically is not readily accessible to all members of the collaboration. In this paper, we discuss the application of Globus Online within a sequencing facility to address the data movement and management challenges that arise as a result of exponentially increasing amount of data being generated by a rapidly growing number of research groups. We also present the unique challenges in applying a Globus Online solution in sequencing center environments and how we overcome those challenges.
Computational and Structural Biotechnology Journal, 2015
Next generation sequencing (NGS) technologies produce massive amounts of data requiring a powerfu... more Next generation sequencing (NGS) technologies produce massive amounts of data requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools. We present a case study of a practical solution to this data management and analysis challenge that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis. These capabilities are implemented using the "Globus Genomics" system, which is an enhanced Galaxy workflow system made available as a service that offers users the capability to process and transfer data easily, reliably and quickly to address end-to-end NGS analysis requirements. The Globus Genomics system is built on Amazon's cloud computing infrastructure. The system takes advantage of elastic scaling of compute resources to run multiple workflows in parallel and it also helps meet the scale-out analysis needs of modern translational genomics research.
An essential step in the discovery of molecular mechanisms contributing to disease phenotypes and... more An essential step in the discovery of molecular mechanisms contributing to disease phenotypes and efficient experimental planning is the development of weighted hypotheses that estimate the functional effects of sequence variants discovered by high-throughput genomics. With the increasing specialization of the bioinformatics resources, creating analytical workflows that seamlessly integrate data and bioinformatics tools developed by multiple groups becomes inevitable. Here we present a case study of a use of the distributed analytical environment integrating four complementary specialized resources, namely the Lynx platform, VISTA RViewer, the Developmental Brain Disorders Database (DBDB), and the RaptorX server, for the identification of high-confidence candidate genes contributing to pathogenesis of spina bifida. The analysis resulted in prediction and validation of deleterious mutations in the SLC19A placental transporter in mothers of the affected children that causes narrowing ...
Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science - WORKS '13, 2013
Workflow systems play an important role in the analysis of the fast-growing genomics data produce... more Workflow systems play an important role in the analysis of the fast-growing genomics data produced by low-cost next generation sequencing (NGS) technologies. Many biomedical research groups lack the expertise to assemble and run the sophisticated computational pipelines required for high-throughput analysis of such data. There is an urgent need for services that can allow researchers to run their analytical workflows where they can define their own research methodologies by selecting the tools of their interest. We present the challenges associated with managing multiple Galaxy instances on the cloud for various research groups using Globus Genomics, a cloud based platform-as-aservice (PaaS) that provides the Galaxy workflow system as a hosted service along with data management capabilities using Globus Online. We address the unique challenges, our strategy, and a tool for automatically deploying and managing hundreds of analytical tools coming from the public Galaxy Tool Shed, new tools wrapped by our group, and tools wrapped by end users across multiple Galaxy instances hosted with Globus Genomics.
Due to the upcoming data deluge of genome data, the need for storing and processing large-scale g... more Due to the upcoming data deluge of genome data, the need for storing and processing large-scale genome data, easy access to biomedical analyses tools, efficient data sharing and retrieval has presented significant challenges. The variability in data volume results in variable computing and storage requirements, therefore biomedical researchers are pursuing more reliable, dynamic and convenient methods for conducting sequencing analyses. This paper proposes a Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses, which enables reliable and highly scalable execution of sequencing analyses workflows in a fully automated manner. Our platform extends the existing Galaxy workflow system by adding data management capabilities for transferring large quantities of data efficiently and reliably (via Globus Transfer), domain-specific analyses tools preconfigured for immediate use by researchers (via user-specific tools integration), automatic deployment on Cloud for on-demand resource allocation and pay-as-you-go pricing (via Globus Provision), a Cloud provisioning tool for auto-scaling (via HTCondor scheduler), and the support for validating the correctness of workflows (via semantic verification tools). Two bioinformatics workflow use cases as well as performance evaluation are presented to validate the feasibility of the proposed approach.
Next generation sequencing (NGS) generates massive amounts of data and its analysis requires acce... more Next generation sequencing (NGS) generates massive amounts of data and its analysis requires access to powerful computational infrastructure, high quality bioinformatics software, and personnel skilled to operate the tools. We present a practical solution to this data management and analysis challenge by using the Globus Genomics platform that simplifies large-scale data handling and provides advanced tools for NGS data analysis. Globus Genomics uses Globus Transfer to allow seamless data transfer from distributed data endpoints such as sequencing centers into its analytical platform for immediate analysis. It hosts an enhanced Galaxy instance on Amazon Web Service (AWS) resources with hundreds of widely used NGS analysis tools and many pre-defined best practices pipelines for whole genome/exome, RNA-seq, or ChIP-seq data analysis. Unlimited scalability and enabling simultaneous analysis of numerous data sets is possible due to the platforms ability to provision on-demand compute cl...
Proceedings of the Conference on Extreme Science and Engineering Discovery Environment Gateway to Discovery - XSEDE '13, 2013
ABSTRACT We describe Globus Genomics, a system that we have developed for rapid analysis of large... more ABSTRACT We describe Globus Genomics, a system that we have developed for rapid analysis of large quantities of next-generation sequencing (NGS) genomic data. This system is notable for its high degree of end-to-end automation, which encompasses every stage of the data analysis pipeline from initial data access (from remote sequencing center or database, by the Globus Online file transfer system) to on-demand resource acquisition (on Amazon EC2, via the Globus Provision cloud manager); specification, configuration, and reuse of multi-step processing pipelines (via the Galaxy workflow system); and efficient scheduling of these pipelines over many processors (via the Condor scheduler). The system allows biomedical researchers to perform rapid analysis of large NGS datasets using just a web browser in a fully automated manner, without software installation.
2012 IEEE 8th International Conference on E-Science, 2012
ABSTRACT In the past few years in the biomedical field, availability of low-cost sequencing metho... more ABSTRACT In the past few years in the biomedical field, availability of low-cost sequencing methods in the form of next-generation sequencing has revolutionized the approaches life science researchers are undertaking in order to gain a better understanding of the causative factors of diseases. With biomedical researchers getting many of their patients' DNA and RNA sequenced, sequencing centers are working with hundreds of researchers with terabytes to petabytes of data for each researcher. The unprecedented scale at which genomic sequence data is generated today by high-throughput technologies requires sophisticated and high-performance methods of data handling and management. For the most part, however, the state of the art is to use hard disks to ship the data. As data volumes reach tens or even hundreds of terabytes, such approaches become increasingly impractical. Data stored on portable media can be easily lost, and typically is not readily accessible to all members of the collaboration. In this paper, we discuss the application of Globus Online within a sequencing facility to address the data movement and management challenges that arise as a result of exponentially increasing amount of data being generated by a rapidly growing number of research groups. We also present the unique challenges in applying a Globus Online solution in sequencing center environments and how we overcome those challenges.
Computational and Structural Biotechnology Journal, 2015
Next generation sequencing (NGS) technologies produce massive amounts of data requiring a powerfu... more Next generation sequencing (NGS) technologies produce massive amounts of data requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools. We present a case study of a practical solution to this data management and analysis challenge that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis. These capabilities are implemented using the "Globus Genomics" system, which is an enhanced Galaxy workflow system made available as a service that offers users the capability to process and transfer data easily, reliably and quickly to address end-to-end NGS analysis requirements. The Globus Genomics system is built on Amazon's cloud computing infrastructure. The system takes advantage of elastic scaling of compute resources to run multiple workflows in parallel and it also helps meet the scale-out analysis needs of modern translational genomics research.
An essential step in the discovery of molecular mechanisms contributing to disease phenotypes and... more An essential step in the discovery of molecular mechanisms contributing to disease phenotypes and efficient experimental planning is the development of weighted hypotheses that estimate the functional effects of sequence variants discovered by high-throughput genomics. With the increasing specialization of the bioinformatics resources, creating analytical workflows that seamlessly integrate data and bioinformatics tools developed by multiple groups becomes inevitable. Here we present a case study of a use of the distributed analytical environment integrating four complementary specialized resources, namely the Lynx platform, VISTA RViewer, the Developmental Brain Disorders Database (DBDB), and the RaptorX server, for the identification of high-confidence candidate genes contributing to pathogenesis of spina bifida. The analysis resulted in prediction and validation of deleterious mutations in the SLC19A placental transporter in mothers of the affected children that causes narrowing ...
Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science - WORKS '13, 2013
Workflow systems play an important role in the analysis of the fast-growing genomics data produce... more Workflow systems play an important role in the analysis of the fast-growing genomics data produced by low-cost next generation sequencing (NGS) technologies. Many biomedical research groups lack the expertise to assemble and run the sophisticated computational pipelines required for high-throughput analysis of such data. There is an urgent need for services that can allow researchers to run their analytical workflows where they can define their own research methodologies by selecting the tools of their interest. We present the challenges associated with managing multiple Galaxy instances on the cloud for various research groups using Globus Genomics, a cloud based platform-as-aservice (PaaS) that provides the Galaxy workflow system as a hosted service along with data management capabilities using Globus Online. We address the unique challenges, our strategy, and a tool for automatically deploying and managing hundreds of analytical tools coming from the public Galaxy Tool Shed, new tools wrapped by our group, and tools wrapped by end users across multiple Galaxy instances hosted with Globus Genomics.
Due to the upcoming data deluge of genome data, the need for storing and processing large-scale g... more Due to the upcoming data deluge of genome data, the need for storing and processing large-scale genome data, easy access to biomedical analyses tools, efficient data sharing and retrieval has presented significant challenges. The variability in data volume results in variable computing and storage requirements, therefore biomedical researchers are pursuing more reliable, dynamic and convenient methods for conducting sequencing analyses. This paper proposes a Cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses, which enables reliable and highly scalable execution of sequencing analyses workflows in a fully automated manner. Our platform extends the existing Galaxy workflow system by adding data management capabilities for transferring large quantities of data efficiently and reliably (via Globus Transfer), domain-specific analyses tools preconfigured for immediate use by researchers (via user-specific tools integration), automatic deployment on Cloud for on-demand resource allocation and pay-as-you-go pricing (via Globus Provision), a Cloud provisioning tool for auto-scaling (via HTCondor scheduler), and the support for validating the correctness of workflows (via semantic verification tools). Two bioinformatics workflow use cases as well as performance evaluation are presented to validate the feasibility of the proposed approach.
Uploads
Papers by Paul Dave