An Inference-Based Framework to Manage Data Provenance in Geoscience Applications

Data provenance allows scientists to validate their model as well as to investigate the origin of an unexpected value. Furthermore, it can be used as a replication recipe for output data products. However, capturing provenance requires enormous effort by scientists in terms of time and training. First, they need to design the workflow of the scientific model, i.e., workflow provenance, which requires both time and training. However, in practice, scientists may not document any workflow provenance before the model execution due to the lack of time and training. Second, they need to capture provenance while the model is running, i.e., fine-grained data provenance. Explicit documentation of fine-grained provenance is not feasible because of the massive storage consumption by provenance data in the applications, including those from the geoscience domain where data are continuously arriving and are processed. In this paper, we propose an inference-based framework, which provides both workflow and fine-grained data provenance at a minimal cost in terms of time, training, and disk consumption. Our proposed framework is applicable to any given scientific model, and is capable of handling different model dynamics, such as variation in the processing time as well as input data products arrival pattern. Our evaluation of the framework in a real use case with geospatial data shows that the proposed framework is relevant and suitable for scientists in geoscientific domain.

[1]  Peng Yue,et al.  Geospatial data provenance in Cyberinfrastructure , 2009, 2009 17th International Conference on Geoinformatics.

[2]  Matthew S. Shields Control- Versus Data-Driven Workflows , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[3]  Yogesh L. Simmhan,et al.  Karma2: Provenance Management for Data-Driven Workflows , 2008, Int. J. Web Serv. Res..

[4]  Paul T. Groth,et al.  PrIMe: A methodology for developing provenance-aware applications , 2011, TSEM.

[5]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[6]  Andrew R. Runnalls,et al.  Provenance-Awareness in R , 2010, IPAW.

[7]  V. Vianu,et al.  Edinburgh Why and Where: A Characterization of Data Provenance , 2017 .

[8]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[9]  Andreas Wombacher,et al.  Probabilistic Inference of Fine-Grained Data Provenance , 2012, DEXA.

[10]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[11]  Andreas Wombacher,et al.  Fine-Grained Provenance Inference for a Large Processing Chain with Non-materialized Intermediate Views , 2012, SSDBM.

[12]  Peter Buneman,et al.  Provenance in databases , 2009, SIGMOD '07.

[13]  James Frew,et al.  Computational provenance in hydrologic science: a snow mapping example , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[14]  Andreas Wombacher,et al.  Inferring Fine-Grained Data Provenance in Stream Data Processing: Reduced Storage Cost, High Accuracy , 2011, DEXA.

[15]  Mark Harman,et al.  FlagRemover: A testability transformation for transforming loop-assigned flags , 2011, TSEM.

[16]  Simon Miles Automatically Adapting Source Code to Document Provenance , 2010, IPAW.

[17]  Rolf Weingartner,et al.  Global monthly water stress: 2. Water demand and severity of water stress , 2011 .

[18]  P. Döll,et al.  MIRCA2000—Global monthly irrigated and rainfed crop areas around the year 2000: A new high‐resolution data set for agricultural and hydrological modeling , 2010 .

[19]  Liping Di,et al.  Augmenting geospatial data provenance through metadata tracking in geospatial service chaining , 2010, Comput. Geosci..

[20]  Andreas Wombacher,et al.  Facilitating fine grained data provenance using temporal data model , 2010, DMSN '10.

[21]  Wil vanderAalst,et al.  Workflow Management: Models, Methods, and Systems , 2004 .

[22]  Thomas W. Reps,et al.  The use of program dependence graphs in software engineering , 1992, International Conference on Software Engineering.

[23]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1984, TOPL.

[24]  M. Bierkens,et al.  Global monthly water stress: 1. Water balance and water availability , 2011 .

[25]  Mark N. Wegman,et al.  An efficient method of computing static single assignment form , 1989, POPL '89.

[26]  Liping Di,et al.  Sharing geospatial provenance in a service-oriented environment , 2011, Comput. Environ. Urban Syst..

[27]  Mark H. Ellisman,et al.  Data-intensive e-science frontier research , 2003, CACM.

[28]  Roger S. Barga,et al.  Automatic capture and efficient storage of e‐Science experiment provenance , 2008, Concurr. Comput. Pract. Exp..

[29]  Petra Döll,et al.  Quantifying blue and green virtual water contents in global crop production as well as potential production losses without irrigation , 2010 .

[30]  Qiang Chen,et al.  Aurora : a new model and architecture for data stream management ) , 2006 .

[31]  D. Lanter Design of a Lineage-Based Meta-Data Base for GIS , 1991 .

[32]  Jennifer Widom,et al.  LIVE: A Lineage-Supported Versioned DBMS , 2010, SSDBM.

[33]  Andreas Wombacher,et al.  Data Workflow - A Workflow Model for Continuous Data Processing , 2010 .

[34]  Liping Di,et al.  A provenance framework for Web geoprocessing workflows , 2011, 2011 IEEE International Geoscience and Remote Sensing Symposium.

[35]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[36]  Bertram Ludäscher,et al.  CONCURRENCY AND COMPUTATION : PRACTICE AND EXPERIENCE Concurrency Computat , 2008 .

[37]  Paul T. Groth,et al.  Automatic Metadata Annotation through Reconstructing Provenance , 2012, SWPM@ESWC.

[38]  Margo I. Seltzer,et al.  StarFlow: A Script-Centric Data Analysis Environment , 2010, IPAW.

[39]  John S. Heidemann,et al.  Provenance in Sensornet Republishing , 2008, IPAW.

[40]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[41]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.