BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments

Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.

[1]  David A. Patterson,et al.  ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing , 2013 .

[2]  Mattia D'Antonio,et al.  WEP: a high-performance analysis pipeline for whole-exome data , 2013, BMC Bioinformatics.

[3]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[4]  Suman,et al.  Comparative Analysis of Classification Algorithms on Different Datasets using WEKA , 2012 .

[5]  Carole A. Goble,et al.  Distilling structure in Taverna scientific workflows: a refactoring approach , 2014, BMC Bioinformatics.

[6]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[7]  Jianwu Wang,et al.  Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper , 2012, EDBT-ICDT '12.

[8]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[9]  Thomas J Naughton,et al.  Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified , 2006, BMC Evolutionary Biology.

[10]  Carole A. Goble,et al.  myExperiment: a repository and social network for the sharing of bioinformatics workflows , 2010, Nucleic Acids Res..

[11]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[12]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[13]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[14]  José Maria N. David,et al.  A Framework for Provenance Analysis and Visualization , 2017, ICCS.

[15]  Akinori Yonezawa,et al.  ParaTrac: a fine-grained profiler for data-intensive workflows , 2010, HPDC '10.

[16]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[17]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[18]  Marta Mattoso,et al.  Integrating Domain-Data Steering with Code-Profiling Tools to Debug Data-Intensive Workflows , 2016, WORKS@SC.

[19]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[20]  Yolanda Gil,et al.  Provenance trails in the Wings/Pegasus system , 2008, Concurr. Comput. Pract. Exp..

[21]  Trilok Chand Sharma,et al.  WEKA Approach for Comparative Study of Classification Algorithm , 2013 .

[22]  Oswaldo Trelles,et al.  Breaking the computational barriers of pairwise genome comparison , 2015, BMC Bioinformatics.

[23]  Carole A. Goble,et al.  Using a suite of ontologies for preserving workflow-centric research objects , 2015, J. Web Semant..

[24]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[25]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[26]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[27]  Don Gilbert,et al.  Sequence File Format Conversion with Command‐Line Readseq , 2003, Current protocols in bioinformatics.

[28]  Murray Cole,et al.  Performance database: capturing data for optimizing distributed streaming workflows , 2011, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[29]  Marta Mattoso,et al.  SciPhy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets in Protozoan Genomes , 2011, BSB.

[30]  William S. Klug,et al.  Concepts of Genetics , 1999 .

[31]  Pablo Lapunzina,et al.  Impact of NGS in the medical sciences: Genetic syndromes with an increased risk of developing cancer as an example of the use of new technologies , 2013, Genetics and molecular biology.

[32]  Jun Qin,et al.  ASKALON: A Development and Grid Computing Environment for Scientific Workflows , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[33]  Bjørn Fjukstad,et al.  A Review of Scalable Bioinformatics Pipelines , 2017, Data Science and Engineering.

[34]  K. Lindblad-Toh,et al.  Comparative genomics as a tool to understand evolution and disease , 2013, Genome research.

[35]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[36]  Ewa Deelman,et al.  Workflow Performance Profiles: Development and Analysis , 2016, Euro-Par Workshops.

[37]  Marta Mattoso,et al.  MTCProv: a practical provenance query framework for many-task scientific computing , 2012, Distributed and Parallel Databases.

[38]  Richard O. Sinnott,et al.  Investigating reproducibility and tracking provenance – A genomic workflow case study , 2017, BMC Bioinformatics.

[39]  Zhao Zhang,et al.  Parallel Scripting for Applications at the Petascale and Beyond , 2009, Computer.

[40]  Haijia Shi Best-first Decision Tree Learning , 2007 .

[41]  A. Anderson The process of structure-based drug design. , 2003, Chemistry & biology.

[42]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[43]  Reynold Xin,et al.  Apache Spark , 2016 .

[44]  Marta Mattoso,et al.  Exploring provenance in high performance scientific computing , 2011, HPCDB '11.

[45]  Luiz M. R. Gadelha,et al.  HPSW-Prof : A Provenance-Based Framework for Profiling High Performance Scientific Workflows , 2016 .

[46]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[47]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..