Utilizing Provenance in Reusable Research Objects

Science is conducted collaboratively, often requiring the sharing of knowledge about computational experiments. When experiments include only datasets, they can be shared using Uniform Resource Identifiers (URIs) or Digital Object Identifiers (DOIs). An experiment, however, seldom includes only datasets, but more often includes software, its past execution, provenance, and associated documentation. The Research Object has recently emerged as a comprehensive and systematic method for aggregation and identification of diverse elements of computational experiments. While a necessary method, mere aggregation is not sufficient for the sharing of computational experiments. Other users must be able to easily recompute on these shared research objects. Computational provenance is often the key to enable such reuse. In this paper, we show how reusable research objects can utilize provenance to correctly repeat a previous reference execution, to construct a subset of a research object for partial reuse, and to reuse existing contents of a research object for modified reuse. We describe two methods to summarize provenance that aid in understanding the contents and past executions of a research object. The first method obtains a process-view by collapsing low-level system information, and the second method obtains a summary graph by grouping related nodes and edges with the goal to obtain a graph view similar to application workflow. Through detailed experiments, we show the efficacy and efficiency of our algorithms.

[1]  Douglas Thain,et al.  An invariant framework for conducting reproducible computational science , 2015, J. Comput. Sci..

[2]  Oscar Corcho,et al.  Workflow-centric research objects: First class citizens in scholarly discourse. , 2012, ESWC 2012.

[3]  Penny Dan Nature Reproducibility survey , 2016 .

[4]  P. Fox,et al.  Documenting Provenance for Reproducible Marine Ecosystem Assessment in Open Science , 2017 .

[5]  Jignesh M. Patel,et al.  Efficient aggregation for graph summarization , 2008, SIGMOD Conference.

[6]  Susan B. Davidson,et al.  Towards a Model of Provenance and User Views in Scientific Workflows , 2006, DILS.

[7]  David De Roure Towards computational research objects , 2013 .

[8]  Carole A. Goble,et al.  Why Linked Data is Not Enough for Scientists , 2010, 2010 IEEE Sixth International Conference on e-Science.

[9]  Douglas Thain,et al.  Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness? , 2015, iPRES.

[10]  V. Stodden,et al.  Toward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals , 2013, PloS one.

[11]  Dennis Shasha,et al.  ReproZip: Using Provenance to Support Computational Reproducibility , 2013, TaPP.

[12]  Philip J. Guo,et al.  CDE: Using System Call Interposition to Automatically Create Portable Software Packages , 2011, USENIX Annual Technical Conference.

[13]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[14]  Carole A. Goble,et al.  Towards the Preservation of Scientific Workflows , 2011, iPRES.

[15]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[16]  Marta Mattoso,et al.  SGProv: Summarization Mechanism for Multiple Provenance Graphs , 2014, J. Inf. Data Manag..

[17]  Margo I. Seltzer,et al.  Local clustering in provenance graphs , 2013, CIKM.

[18]  Yaxing Wei,et al.  YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts , 2015, ArXiv.

[19]  Ian T. Foster,et al.  LDV: Light-weight database virtualization , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[20]  Philippe Bonnet,et al.  Computational reproducibility: state-of-the-art, challenges, and database research opportunities , 2012, SIGMOD Conference.

[21]  Jim X. Chen,et al.  Geographic Information Systems , 2010, Computing in Science & Engineering.

[22]  Brendan D. McKay,et al.  Practical graph isomorphism, II , 2013, J. Symb. Comput..

[23]  John Taylor,et al.  Data Provenance and Data Management in eScience , 2014 .

[24]  Ilkay Altintas,et al.  Provenance Collection Support in the Kepler Scientific Workflow System , 2006, IPAW.

[25]  Tanu Malik,et al.  Sciunits: Reusable Research Objects , 2017, 2017 IEEE 13th International Conference on e-Science (e-Science).

[26]  Ian T. Foster,et al.  Using Provenance for Repeatability , 2013, TaPP.

[27]  Tomasz Miksa,et al.  Using ontologies for verification and validation of workflow-based experiments , 2017, J. Web Semant..

[28]  Reagan Moore,et al.  Using a data grid to automate data preparation pipelines required for regional-scale hydrologic modeling , 2016, Environ. Model. Softw..

[29]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[30]  Quan Tran Pham A framework for reproducible computational research , 2014 .

[31]  Juliana Freire,et al.  noWorkflow: Capturing and Analyzing Provenance of Scripts , 2014, IPAW.

[32]  Yves Janin,et al.  CARE, the comprehensive archiver for reproducible execution , 2014, TRUST '14.

[33]  Bertram Ludäscher,et al.  Linking Prospective and Retrospective Provenance in Scripts , 2015, TaPP.

[34]  Carole A. Goble,et al.  Using a suite of ontologies for preserving workflow-centric research objects , 2015, J. Web Semant..

[35]  Ashish Gehani,et al.  SPADE: Support for Provenance Auditing in Distributed Environments , 2012, Middleware.

[36]  Idafen Santana-Perez,et al.  Towards Reproducibility in Scientific Workflows: An Infrastructure-Based Approach , 2015, Sci. Program..

[37]  김종영 구글 TensorFlow 소개 , 2015 .

[38]  Victoria Stodden,et al.  Reproducible Research , 2019, The New Statistics with R.

[39]  Philip J. Guo CDE: Run Any Linux Application On-Demand Without Installation , 2011, LISA.

[40]  Brigid Wilson,et al.  Implementing Reproducible Research , 2014 .

[41]  Fareed Zaffar,et al.  Sketching Distributed Data Provenance , 2013 .

[42]  Paul Watson,et al.  A framework for scientific workflow reproducibility in the cloud , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).