Provenance and Annotation of Data and Processes

Many resource-intensive analytics processes evolve over time following new versions of the reference datasets and software dependencies they use. We focus on scenarios in which any version change has the potential to affect many outcomes, as is the case for instance in high throughput genomics where the same process is used to analyse large cohorts of patient genomes, or cases. As any version change is unlikely to affect the entire population, an efficient strategy for restoring the currency of the outcomes requires first to identify the scope of a change, i.e., the subset of affected data products. In this paper we describe a generic and reusable provenance-based approach to address this scope discovery problem. It applies to a scenario where the process consists of complex hierarchical components, where different input cases are processed using different version configurations of each component, and where separate provenance traces are collected for the executions of each of the components. We show how a new data structure, called a restart tree, is computed and exploited to manage the change scope discovery problem.

[1]  J. Anthony Tyson,et al.  Large Synoptic Survey Telescope: Overview , 2002, SPIE Astronomical Telescopes + Instrumentation.

[2]  Luc Moreau,et al.  UML2PROV: Automating Provenance Capture in Software Engineering , 2018, SOFSEM.

[3]  Paul T. Groth,et al.  Pipeline-centric provenance model , 2009, WORKS '09.

[4]  Scott Klasky,et al.  In Situ Methods, Infrastructures, and Applications on High Performance Computing Platforms , 2016, Comput. Graph. Forum.

[5]  Marta Mattoso,et al.  In situ visualization and data analysis for turbidity currents simulation , 2018, Comput. Geosci..

[6]  E. al.,et al.  The Sloan Digital Sky Survey: Technical summary , 2000, astro-ph/0006396.

[7]  David Abramson,et al.  WorkWays: interacting with scientific workflows , 2015, Concurr. Comput. Pract. Exp..

[8]  David Abramson,et al.  Nimrod/K: towards massively parallel dynamic grid workflows , 2008, HiPC 2008.

[9]  D. Lanter Design of a Lineage-Based Meta-Data Base for GIS , 1991 .

[10]  Carole A. Goble,et al.  myGrid: personalised bioinformatics on the information grid , 2003, ISMB.

[11]  Luc Moreau,et al.  A Templating System to Generate Provenance , 2018, IEEE Transactions on Software Engineering.

[12]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[13]  Paul T. Groth,et al.  PROV2R: Practical Provenance Analysis of Unstructured Processes , 2017, ACM Trans. Internet Techn..

[14]  Rizos Sakellariou,et al.  A characterization of workflow management systems for extreme-scale applications , 2016, Future Gener. Comput. Syst..

[15]  Marta Mattoso,et al.  Data-centric iteration in dynamic workflows , 2015, Future Gener. Comput. Syst..

[16]  Norman W. Paton,et al.  Utility functions for adaptively executing concurrent workflows , 2011, Concurr. Comput. Pract. Exp..

[17]  Erik Lindahl,et al.  Copernicus, a hybrid dataflow and peer-to-peer scientific computing platform for efficient large-scale ensemble sampling , 2017, Future Gener. Comput. Syst..

[18]  Marta Mattoso,et al.  Capturing and querying workflow runtime provenance with PROV: a practical approach , 2013, EDBT '13.

[19]  Lokman Zaibet,et al.  Private incentives for adopting food safety and quality assurance , 1999 .

[20]  Daniel Deutch,et al.  Analyzing data-centric applications: Why, what-if, and how-to , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[21]  F. Bonnarel,et al.  The SIMBAD astronomical database. The CDS reference database for astronomical objects , 2000, astro-ph/0002110.

[22]  Marco Aurélio Stelmar Netto,et al.  JobPruner: A Machine Learning Assistant for Exploring Parameter Spaces in HPC Applications , 2018, Future Gener. Comput. Syst..

[23]  Juliana Freire,et al.  noWorkflow: Capturing and Analyzing Provenance of Scripts , 2014, IPAW.

[24]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[25]  Yaxing Wei,et al.  YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts , 2015, ArXiv.

[26]  Marta Mattoso,et al.  Raw data queries during data-intensive parallel workflow execution , 2017, Future Gener. Comput. Syst..

[27]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[28]  Brendan J. Frey,et al.  Factor Graphs and Algorithms , 2008 .

[29]  Jennifer Widom,et al.  Logical provenance in data-oriented workflows? , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[30]  Wei Chen,et al.  FireWorks: a dynamic workflow system designed for high‐throughput applications , 2015, Concurr. Comput. Pract. Exp..

[31]  Ernest E. Croner,et al.  The Palomar Transient Factory: System Overview, Performance, and First Results , 2009, 0906.5350.

[32]  Jignesh M. Patel,et al.  Big data and its technical challenges , 2014, CACM.

[33]  Marta Mattoso,et al.  Dynamic steering of HPC scientific workflows: A survey , 2015, Future Gener. Comput. Syst..

[34]  Marsha H. Cohen The Unknown and the Unknowable-Managing Sustained Uncertainty , 1993, Western journal of nursing research.

[35]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.