Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities

With the development of new experimental technologies, biologists are faced with an avalanche of data to be computationally analyzed for scientific advancements and discoveries to emerge. Faced with the complexity of analysis pipelines, the large number of computational tools, and the enormous amount of data to manage, there is compelling evidence that many if not most scientific discoveries will not stand the test of time: increasing the reproducibility of computed results is of paramount importance. The objective we set out in this paper is to place scientific workflows in the context of reproducibility. To do so, we define several kinds of repro-ducibility that can be reached when scientific workflows are used to perform experiments. We characterize and define the criteria that need to be catered for by reproducibility-friendly scientific workflow systems, and use such criteria to place several representative and widely used workflow systems and companion tools within such a framework. We also discuss the remaining challenges posed by reproducible scientific workflows in the life sciences. Our study was guided by three use cases from the life science domain involving in silico experiments.

[1]  Carole A. Goble,et al.  Common motifs in scientific workflows: An empirical analysis , 2012, 2012 IEEE 8th International Conference on E-Science.

[2]  Brian A. Nosek,et al.  An open investigation of the reproducibility of cancer biology research , 2014, eLife.

[3]  Kevin D. Murray,et al.  TraitCapture: genomic and environment modelling of plant phenomic data. , 2014, Current opinion in plant biology.

[4]  J. Ioannidis,et al.  Public Availability of Published Research Data in High-Impact Journals , 2011, PloS one.

[5]  S. Gabriel,et al.  Advances in understanding cancer genomes through second-generation sequencing , 2010, Nature Reviews Genetics.

[6]  Carole A. Goble,et al.  Using a suite of ontologies for preserving workflow-centric research objects , 2015, J. Web Semant..

[7]  Susanna-Assunta Sansone,et al.  linkedISA: semantic representation of ISA-Tab experimental metadata , 2014, BMC Bioinformatics.

[8]  Bill Howe CDE: A Tool for Creating Portable Experimental Software Packages , 2012 .

[9]  Di Tommaso Paolo,et al.  A novel tool for highly scalable computational pipelines , 2014 .

[10]  Michael B. Yaffe,et al.  Reproducibility in science , 2015, Science Signaling.

[11]  V. Stodden,et al.  Toward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals , 2013, PloS one.

[12]  Peter Houghton,et al.  A Proposal Regarding Reporting of In Vitro Testing Results , 2013, Clinical Cancer Research.

[13]  Bertram Ludäscher,et al.  Compiling abstract scientific workflows into Web service workflows , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[14]  M. S. Avila-Garcia,et al.  From peer-reviewed to peer-reproduced: a role for data standards, models and computational workflows in scholarly publishing , 2014, bioRxiv.

[15]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[16]  Roger D Peng,et al.  Reproducible research and Biostatistics. , 2009, Biostatistics.

[17]  Carole A. Goble,et al.  Results May Vary: Reproducibility, Open Science and All That Jazz , 2013, LISC@ISWC.

[18]  Carole A. Goble,et al.  The design and realisation of the myExperiment Virtual Research Environment for social sharing of workflows , 2009, Future Gener. Comput. Syst..

[19]  Yolanda Gil,et al.  Use of semantic workflows to enhance transparency and reproducibility in clinical omics , 2015, Genome Medicine.

[20]  Carole A. Goble,et al.  Why workflows break — Understanding and combating decay in Taverna workflows , 2012, 2012 IEEE 8th International Conference on E-Science.

[21]  Fabien Campagne,et al.  NextflowWorkbench: reproducible and reusable workflows for beginners and experts , 2016 .

[22]  Juliana Freire,et al.  Reproducibility of Data-Oriented Experiments in e-Science (Dagstuhl Seminar 16041) , 2016, Dagstuhl Reports.

[23]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[25]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[26]  Jeffrey S. Racine,et al.  RStudio: A Platform-Independent IDE for R and Sweave , 2012 .

[27]  I. Cockburn,et al.  The Economics of Reproducibility in Preclinical Research , 2015, PLoS biology.

[28]  Philippe Bonnet,et al.  Computational reproducibility: state-of-the-art, challenges, and database research opportunities , 2012, SIGMOD Conference.

[29]  Stian Soiland-Reyes,et al.  scufl2-wfdesc 0.3.7 , 2014 .

[30]  Carole A. Goble,et al.  The Data Playground: An Intuitive Workflow Specification Environment , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[31]  Bertram Ludäscher Technical Note : SciDAC-SPA-TN-2003-01 On Providing Declarative Design and Programming Constructs for Scientific Workflows based on Process Networks , 2003 .

[32]  Juliana Freire,et al.  noWorkflow: Capturing and Analyzing Provenance of Scripts , 2014, IPAW.

[33]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[34]  Ulf Leser,et al.  Effective and efficient similarity search in scientific workflow repositories , 2016, Future Gener. Comput. Syst..

[35]  Ulrich Schurr,et al.  EMPHASIS – European Multi-environment Plant pHenotyping And Simulation InfraStructure , 2015 .

[36]  Carole A. Goble,et al.  Best Practices for Workflow Design: How to Prevent Workflow Decay , 2012, SWAT4LS.

[37]  Carole A. Goble,et al.  Enhancing and abstracting scientific workflow provenance for data publishing , 2013, EDBT '13.

[38]  Michael R. Crusoe,et al.  Common Workflow Language , 2015 .

[39]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[40]  Joachim Kunert,et al.  Systematic variation improves reproducibility of animal experiments , 2010, Nature Methods.

[41]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[42]  Brigid Wilson,et al.  Implementing Reproducible Research , 2014 .

[43]  Carole A. Goble,et al.  DistillFlow: removing redundancy in scientific workflows , 2014, SSDBM '14.

[44]  M. Ragan-Kelley,et al.  The Jupyter/IPython architecture: a unified view of computational research, from interactive exploration to communication and publication. , 2014 .

[45]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[46]  Ann L. Chervenak,et al.  Characterizing and profiling scientific workflows , 2013, Future Gener. Comput. Syst..

[47]  M. Tester,et al.  Phenomics--technologies to relieve the phenotyping bottleneck. , 2011, Trends in plant science.

[48]  Susan B. Davidson,et al.  Zoom*UserViews: Querying Relevant Provenance in Workflow Systems , 2007, VLDB.

[49]  Dennis Shasha,et al.  ReproZip: Using Provenance to Support Computational Reproducibility , 2013, TaPP.

[50]  Carole A. Goble,et al.  Data Lineage Model for Taverna Workflows with Lightweight Annotation Requirements , 2008, IPAW.

[51]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[52]  Jan Krüger,et al.  Personalized cloud-based bioinformatics services for research and education: use cases and the elasticHPC package , 2012, BMC Bioinformatics.

[53]  Johan Montagnat,et al.  IWIR: a language enabling portability across grid workflow systems , 2011, WORKS '11.

[54]  Sven Rahmann,et al.  Genome analysis , 2022 .

[55]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[56]  Gábor Terstyánszky,et al.  SHIWA workflow interoperability solutions for neuroimaging data analysis , 2012, HealthGrid.

[57]  Jun Wei,et al.  Cost and accuracy aware scientific workflow retrieval based on distance measure , 2015, Inf. Sci..

[58]  C. Drummond Replicability is not Reproducibility:Nor is it Good Science , 2009 .

[59]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[60]  Ralph Bergmann,et al.  Similarity assessment and efficient retrieval of semantic workflows , 2014, Inf. Syst..

[61]  Greg Wilson,et al.  Software Carpentry: lessons learned , 2014, F1000Research.

[62]  J. Ioannidis,et al.  Reproducibility in Science: Improving the Standard for Basic and Preclinical Research , 2015, Circulation research.

[63]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[64]  Cláudio T. Silva,et al.  CrowdLabs: Social Analysis and Visualization for the Sciences , 2011, SSDBM.

[65]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[66]  Moustafa Ghanem,et al.  Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support , 2012, BMC Bioinformatics.

[67]  Yaxing Wei,et al.  YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts , 2015, ArXiv.

[68]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[69]  Johan Montagnat,et al.  Domain-specific summarization of Life-Science e-experiments from provenance traces , 2014, J. Web Semant..

[70]  Philip J. Guo CDE: A Tool for Creating Portable Experimental Software Packages , 2012, Computing in Science & Engineering.

[71]  Patrick Valduriez,et al.  InfraPhenoGrid: A scientific workflow infrastructure for plant phenomics on the Grid , 2017, Future Gener. Comput. Syst..

[72]  Johan Montagnat,et al.  Scientific workflow reuse through conceptual workflows on the virtual imaging platform , 2011, WORKS '11.

[73]  Cláudio T. Silva,et al.  Managing Rapidly-Evolving Scientific Workflows , 2006, IPAW.

[74]  Ulf Leser,et al.  Similarity Search for Scientific Workflows , 2014, Proc. VLDB Endow..

[75]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[76]  Carole A. Goble,et al.  BioCatalogue: a universal catalogue of web services for the life sciences , 2010, Nucleic Acids Res..

[77]  Patrick Valduriez,et al.  OpenAlea: scientific workflows combining data analysis and simulation , 2015, SSDBM.

[78]  C. Fournier,et al.  OpenAlea: a visual programming and component-based software platform for plant modelling. , 2008, Functional plant biology : FPB.

[79]  Elaine R. Mardis,et al.  A decade’s perspective on DNA sequencing technology , 2011, Nature.

[80]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.

[81]  James Taylor,et al.  Next-generation sequencing data interpretation: enhancing reproducibility and accessibility , 2012, Nature Reviews Genetics.

[82]  Anton Nekrutenko,et al.  Dissemination of scientific software with Galaxy ToolShed , 2014, Genome Biology.

[83]  Johan Montagnat,et al.  Fine-Grain Interoperability of Scientific Workflows in Distributed Computing Infrastructures , 2013, Journal of Grid Computing.

[84]  Daniela Grigori,et al.  Mining Workflow Repositories for Improving Fragments Reuse , 2015, International KEYSTONE Conference.

[85]  Paul King,et al.  Groovy in Action , 2007 .

[86]  Ulf Leser,et al.  Search, adapt, and reuse: the future of scientific workflows , 2011, SGMD.

[87]  Marta Mattoso,et al.  Capturing and querying workflow runtime provenance with PROV: a practical approach , 2013, EDBT '13.

[88]  Silvio C. E. Tosatto,et al.  Tools and data services registry: a community effort to document bioinformatics resources , 2015, Nucleic Acids Res..

[89]  Yolanda Gil,et al.  Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome , 2013, PloS one.