Using Domain-Specific Data to Enhance Scientific Workflow Steering Queries

In scientific workflows, provenance data helps scientists in understanding, evaluating and reproducing their results. Provenance data generated at runtime can also support workflow steering mechanisms. Steering facilities for workflows is considered a challenge due to its dynamic demands during execution. To steer, for example, scientists should be able to suspend (or stop) a workflow execution when the approximate solution meets (or deviates) preset criteria. These criteria are commonly evaluated based on provenance data (execution data) and domain-specific data. We claim that the final decision on whether to interfere on the workflow execution may only become feasible when workflows can be steered by scientists using provenance data enriched with domain-specific data. In this paper we propose an approach based on specialized software components, named Data Extractor (DE), to acquire domain-specific data from data files produced during a scientific workflow execution. DE gathers domain-specific data from produced data files and associates it to existing provenance data on the provenance repository. We have evaluated the proposed approach using a real bioinformatics workflow for comparative genomics executed in SciCumulus cloud workflow parallel engine.

[1]  Amit P. Sheth,et al.  Provenir Ontology: Towards a Framework for eScience Provenance Management , 2009 .

[2]  Carole A. Goble,et al.  The myGrid ontology: bioinformatics service discovery , 2007, Int. J. Bioinform. Res. Appl..

[3]  Amit P. Sheth,et al.  Janus: From Workflows to Semantic Provenance and Linked Open Data , 2010, IPAW.

[4]  Yogesh L. Simmhan,et al.  A Framework for Collecting Provenance in Data-Centric Scientific Workflows , 2006, 2006 IEEE International Conference on Web Services (ICWS'06).

[5]  Geoffrey C. Fox,et al.  Examining the Challenges of Scientific Workflows , 2007, Computer.

[6]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[7]  Bertram Ludäscher,et al.  Exploring Scientific Workflow Provenance Using Hybrid Queries over Nested Data and Lineage Graphs , 2009, SSDBM.

[8]  Marta Mattoso,et al.  Exploring many task computing in scientific workflows , 2009, MTAGS '09.

[9]  Marta Mattoso,et al.  An adaptive parallel execution strategy for cloud‐based scientific workflows , 2012, Concurr. Comput. Pract. Exp..

[10]  Simon Miles Automatically Adapting Source Code to Document Provenance , 2010, IPAW.

[11]  Geoffrey C. Fox,et al.  MPJ: MPI-like message passing for Java , 2000, Concurr. Pract. Exp..

[12]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences: current status, policy and new initiatives , 2008, Nucleic Acids Res..

[13]  Shahar Ronen,et al.  Authenticity and Provenance in Long Term Digital Preservation: Modeling and Implementation in Preservation Aware Storage , 2009, Workshop on the Theory and Practice of Provenance.

[14]  Andrew G Clark,et al.  Genomics of the evolutionary process. , 2006, Trends in ecology & evolution.

[15]  Paul T. Groth,et al.  Pipeline-centric provenance model , 2009, WORKS '09.

[16]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[17]  Luc Moreau,et al.  The Open Provenance Model: An Overview , 2008, IPAW.

[18]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[19]  Marta Mattoso,et al.  GExpLine: A Tool for Supporting Experiment Composition , 2010, IPAW.

[20]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[21]  Marta Mattoso,et al.  UNCERTAINTY QUANTIFICATION IN COMPUTATIONAL PREDICTIVE MODELS FOR FLUID DYNAMICS USING A WORKFLOW MANAGEMENT ENGINE , 2012 .

[22]  Daniel Crawl,et al.  A Provenance-Based Fault Tolerance Mechanism for Scientific Workflows , 2008, IPAW.

[23]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[24]  Marta Mattoso,et al.  An algebraic approach for data-centric scientific workflows , 2011, Proc. VLDB Endow..

[25]  Chris Rose,et al.  A Break in the Clouds: Towards a Cloud Definition , 2011 .

[26]  Anton Nekrutenko,et al.  Comparative genomics. , 2004, Annual review of genomics and human genetics.

[27]  Marta Mattoso,et al.  Provenance Query Patterns for Many-Task Scientific Computing , 2011, TaPP.

[28]  Marianne Winslett,et al.  Scientific and Statistical Database Management, 21st International Conference, SSDBM 2009, New Orleans, LA, USA, June 2-4, 2009, Proceedings , 2009, SSDBM.

[29]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[30]  Marta Mattoso,et al.  Supporting dynamic parameter sweep in adaptive and user-steered workflow , 2011, WORKS '11.

[31]  Marta Mattoso,et al.  Optimizing Phylogenetic Analysis Using SciHmm Cloud-based Scientific Workflow , 2011, 2011 IEEE Seventh International Conference on eScience.

[32]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[33]  Paolo Missier,et al.  Incremental Workflow Improvement Through Analysis of Its Data Provenance , 2011, TaPP.

[34]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..