Refurbishing Legacy Biological Workflows SPROUTS Case Study

Scientific discovery relies on an experimental framework that corroborates hypotheses with experiments that are complex reproducible processes generating and transforming large datasets. The methods, implicit in the process, capture the semantics of the data, thus they are responsible for the generation of scientific information and discovery of scientific knowledge. Scientific workflows provide the semantics needed to wrap scientific data from their capture, analysis, publication, and archival. By annotating data with the processes that produce them, the scientist no longer manages data but information and allows their meaningful interpretation and integration. Any change to a scientific workflow may impact significantly the quality of the data produced, their semantics, their future analysis, use, integration, and distribution, as well as the performance of the execution. Yet, scientific workflows are typically transformed over time, updated with new versions of the tools that compose them, extended to new functionality, and composed. In this paper we discuss the various impacts of workflow transformation and illustrate them with a case study on the Structural Prediction for pRotein fOlding UTility System (SPROUTS) Workflow.

[1]  Hongyi Zhou,et al.  Distance‐scaled, finite ideal‐gas reference state improves structure‐derived potentials of mean force for structure selection and stability prediction , 2002, Protein science : a publication of the Protein Society.

[2]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[3]  Sanjeev Khanna,et al.  Edinburgh Research Explorer On the Propagation of Deletions and Annotations through Views , 2013 .

[4]  Zoé Lacroix,et al.  SPROUTS: a database for the evaluation of protein stability upon point mutation , 2008, Nucleic Acids Res..

[5]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.

[6]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[7]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[8]  Peter A. Dinda,et al.  Preliminary Report on the Design of a Framework for Distributed Visualization , 1999, PDPTA.

[9]  A. Fersht Nucleation mechanisms in protein folding. , 1997, Current opinion in structural biology.

[10]  A. Fersht,et al.  Phi-value analysis and the nature of protein-folding transition states. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Jacques Chomilier,et al.  Universal positions in globular proteins. , 2004, European journal of biochemistry.

[12]  Bertram Ludäscher,et al.  Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data , 2006, DILS.

[13]  Akinori Sarai,et al.  ProTherm and ProNIT: thermodynamic databases for proteins and protein–nucleic acid interactions , 2005, Nucleic Acids Res..

[14]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[15]  Paul T. Groth,et al.  Recording and using provenance in a protein compressibility experiment , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[16]  Shawn Bowers,et al.  An approach for pipelining nested collections in scientific workflows , 2005, SGMD.

[17]  Carole A. Goble,et al.  Using Semantic Web Technologies for Representing E-science Provenance , 2004, SEMWEB.

[18]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[19]  D. Lanter Design of a Lineage-Based Meta-Data Base for GIS , 1991 .

[20]  Zoé Lacroix,et al.  Modeling and Storing Scientific Protocols , 2006, OTM Workshops.

[21]  D Gilis,et al.  PoPMuSiC, an algorithm for predicting protein mutant stability changes: application to prion proteins. , 2000, Protein engineering.

[22]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[23]  E. Trifonov,et al.  Closed loops of nearly standard size: common basic element of protein structure , 2000, FEBS letters.

[24]  I. Berezovsky,et al.  Distribution of tightened end fragments of globular proteins statistically matches that of topohydrophobic positions: towards an efficient punctuation of protein folding? , 2001, Cellular and Molecular Life Sciences CMLS.

[25]  Piero Fariselli,et al.  I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure , 2005, Nucleic Acids Res..

[26]  E. Haas,et al.  Nonlocal interactions stabilize long range loops in the initial folding intermediates of reduced bovine pancreatic trypsin inhibitor. , 1995, Biochemistry.

[27]  Motonori Ota,et al.  The Protein Mutant Database , 1999, Nucleic Acids Res..

[28]  Yolanda Gil,et al.  Pegasus: Mapping Scientific Workflows onto the Grid , 2004, European Across Grids Conference.

[29]  Marc Spraragen,et al.  Simplifying construction of complex workflows for non-expert users of the Southern California Earthquake Center Community Modeling Environment , 2005, SGMD.

[30]  Wang Chiew Tan,et al.  An annotation management system for relational databases , 2004, The VLDB Journal.

[31]  Ricardo da Silva Torres,et al.  WOODSS and the Web: annotating and reusing scientific workflows , 2005, SGMD.

[32]  Philip A. Bernstein,et al.  Meta-Data Support for Data Transformations Using Microsoft Repository , 1999, IEEE Data Eng. Bull..

[33]  Bertram Ludäscher,et al.  Provenance in Scientific Workflow Systems , 2007, IEEE Data Eng. Bull..

[34]  Arlo Z. Randall,et al.  Prediction of protein stability changes for single‐site mutations using support vector machines , 2005, Proteins.

[35]  Sükrü Tüzmen,et al.  Reasoning on Scientific Workflows , 2009, 2009 Congress on Services - I.

[36]  James D. Myers,et al.  Re-integrating the research record , 2003, Comput. Sci. Eng..

[37]  V. Vianu,et al.  Edinburgh Why and Where: A Characterization of Data Provenance , 2017 .

[38]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[39]  Yong Zhao,et al.  A notation and system for expressing and executing cleanly typed workflows on messy scientific data , 2005, SGMD.

[40]  Radu Prodan,et al.  Scheduling of scientific workflows in the ASKALON grid environment , 2005, SGMD.