Provenance-based reproducibility in the Semantic Web

Reproducibility is a crucial property of data since it allows users to understand and verify how data was derived, and therefore allows them to put their trust in such data. Reproducibility is essential for science, because the reproducibility of experimental results is a tenet of the scientific method, but reproducibility is also beneficial in many other fields, including automated decision making, visualization, and automated data feeds. To achieve the vision of reproducibility, the workflow-based community has strongly advocated the use of provenance as an underpinning mechanism for reproducibility, since a rich representation of provenance allows steps to be reproduced and all intermediary and final results checked and validated. Concurrently, multiple ontology-based representations of provenance have been devised, to be able to describe past computations, uniformly across a variety of technologies. However, such Semantic Web representations of provenance do not have any formal link with execution. Even assuming a faithful and non-malicious environment, how can we claim that an ontology-based representation of provenance enables reproducibility, since it has not been given any execution semantics, and therefore has no formal way of expressing the reproduction of computations? This is the problem that this paper tackles by defining a denotational semantics for the Open Provenance Model, which is referred to as the reproducibility semantics. This semantics is used to implement a reproducibility service, leveraging multiple Semantic Web technologies, and offering a variety of reproducibility approaches, found in the literature. A series of empirical experiments were designed to exhibit the range of reproducibility capabilities of our approach; in particular, we demonstrate the ability to reproduce computations involving multiple technologies, as is commonly found on the Web.

[1]  Margo I. Seltzer,et al.  Provenance: a future history , 2009, OOPSLA Companion.

[2]  Ian Foster,et al.  The First Provenance Challenge , 2008 .

[3]  Peter Edwards,et al.  Data Provenance, Evidence-Based Policy Assessment, and e-Social Science , 2007 .

[4]  Carmem S. Hara,et al.  Querying and Managing Provenance through User Views in Scientific Workflows , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[5]  Joseph Y. Halpern,et al.  Causes and Explanations: A Structural-Model Approach. Part II: Explanations , 2001, The British Journal for the Philosophy of Science.

[6]  Amit P. Sheth,et al.  Semantic Provenance for eScience: Managing the Deluge of Scientific Data , 2008, IEEE Internet Computing.

[7]  Cláudio T. Silva,et al.  Provenance for Visualizations: Reproducibility and Beyond , 2007, Computing in Science & Engineering.

[8]  Paul T. Groth,et al.  Provenance-based validation of e-science experiments , 2005, J. Web Semant..

[9]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..

[10]  Joe Futrelle,et al.  Reasoning about Provenance with OWL and SWRL Rules , 2008, AAAI Spring Symposium: AI Meets Business Rules and Process Management.

[11]  Gregor von Laszewski,et al.  Swift: Fast, Reliable, Loosely Coupled Parallel Computation , 2007, 2007 IEEE Congress on Services (Services 2007).

[12]  Jennifer Widom,et al.  Practical lineage tracing in data warehouses , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[13]  Boris Motik,et al.  Representing ontologies using description logics, description graphs, and rules , 2009, Artif. Intell..

[14]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[15]  Sarvapali D. Ramchurn,et al.  Agent-based micro-storage management for the Smart Grid , 2010, AAMAS.

[16]  Brian Neil Levine,et al.  DEX: Digital evidence provenance supporting reproducibility and comparison , 2009 .

[17]  Debmalya Panigrahi,et al.  Preserving Module Privacy in Workflow Provenance , 2010, ArXiv.

[18]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[19]  Amit P. Sheth,et al.  Provenance Algebra and Materialized View-Based Provenance Management , 2008 .

[20]  Geoffrey C. Fox,et al.  Examining the Challenges of Scientific Workflows , 2007, Computer.

[21]  Jonathan Rees Recommendations for independent scholarly publication of data sets , 2010 .

[22]  Paul T. Groth,et al.  The provenance of electronic data , 2008, CACM.

[23]  Vikrambhai S. Sorathia,et al.  Data Provenance , 2009, Encyclopedia of Data Warehousing and Mining.

[24]  John C. Reynolds,et al.  The discoveries of continuations , 1993, LISP Symb. Comput..

[25]  Anne E. Trefethen,et al.  Cyberinfrastructure for e-Science , 2005, Science.

[26]  Jill P Mesirov,et al.  Accessible Reproducible Research , 2010, Science.

[27]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[28]  Jeremy J. Carroll,et al.  Named graphs, provenance and trust , 2005, WWW '05.

[29]  Sean Bechhofer,et al.  Research Objects: Towards Exchange and Reuse of Digital Knowledge , 2010 .

[30]  Jeremy J. Carroll,et al.  Named graphs , 2005, J. Web Semant..

[31]  Luc Moreau,et al.  The Foundations for Provenance on the Web , 2010, Found. Trends Web Sci..

[32]  Carole A. Goble,et al.  Taverna Workflows: Syntax and Semantics , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[33]  James Cheney,et al.  Provenance Traces , 2008, ArXiv.

[34]  Matthias Schwab,et al.  Making scientific computations reproducible , 2000, Comput. Sci. Eng..

[35]  Paul T. Groth,et al.  Expressive Reusable Workflow Templates , 2009, 2009 Fifth IEEE International Conference on e-Science.

[36]  Ian Foster,et al.  Special Issue: The First Provenance Challenge , 2008 .

[37]  James Cheney,et al.  Causality and the Semantics of Provenance , 2010, DCM.

[38]  James A. Hendler,et al.  A Semantic Web approach to the provenance challenge , 2008, Concurr. Comput. Pract. Exp..

[39]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[40]  David L. Donoho,et al.  WaveLab and Reproducible Research , 1995 .

[41]  Edward J. Callahan,et al.  Illuminating the 'Black Box' , 1998 .

[42]  Paul T. Groth,et al.  The Requirements of Using Provenance in e-Science Experiments , 2007, Journal of Grid Computing.

[43]  Robert Stevens,et al.  Treating Shimantic Web Syndrome with Ontologies , 2004 .

[44]  Vladimiro Sassone,et al.  A Formal Model of Provenance in Distributed Systems , 2009, Workshop on the Theory and Practice of Provenance.

[45]  Mark Hedges,et al.  Arts and Humanities e-Science From Ad Hoc Experimentation to Systematic Investigation , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[46]  James Cheney,et al.  A Graph Model of Data and Workflow Provenance , 2010, TaPP.

[47]  V. Vianu,et al.  Edinburgh Why and Where: A Characterization of Data Provenance , 2017 .

[48]  Joseph Y. Halpern,et al.  Causes and Explanations: A Structural-Model Approach. Part I: Causes , 2000, The British Journal for the Philosophy of Science.

[49]  Yong Zhao,et al.  Applying the Virtual Data Provenance Model , 2006, IPAW.

[50]  Boris Motik,et al.  OWL 2 Web Ontology Language: structural specification and functional-style syntax , 2008 .

[51]  Robert Stevens,et al.  Representing Chemicals Using OWL, Description Graphs and Rules , 2010, OWLED.

[52]  Deborah L. McGuinness,et al.  Linked provenance data: A semantic Web-based approach to interoperable workflow traces , 2011, Future Gener. Comput. Syst..

[53]  Cláudio T. Silva,et al.  The Provenance of Workflow Upgrades , 2010, IPAW.

[54]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[55]  Luc Moreau,et al.  The Foundations of the Open Provenance Model , 2009 .

[56]  James Cheney,et al.  Towards a Theory of Information Preservation , 2001, ECDL.

[57]  Lynn Margaret Batten,et al.  Reproducibility of Digital Evidence in Forensic Investigations , 2005, DFRWS.

[58]  Carole A. Goble,et al.  Workflows to open provenance graphs, round-trip , 2011, Future Gener. Comput. Syst..

[59]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[60]  Yong Zhao,et al.  A notation and system for expressing and executing cleanly typed workflows on messy scientific data , 2005, SGMD.

[61]  Yolanda Gil,et al.  Provenance trails in the Wings-Pegasus system , 2008 .

[62]  Karen Schuchardt,et al.  Application of Named Graphs Towards Custom Provenance Views , 2009, Workshop on the Theory and Practice of Provenance.

[63]  Joseph Y. Halpern,et al.  Causes and explanations: A structural-model approach , 2000 .

[64]  Paul T. Groth,et al.  A model of process documentation to determine provenance in mash-ups , 2009, TOIT.

[65]  Olaf Hartig Provenance Information in the Web of Data , 2009, LDOW.

[66]  Babak Esfandiari,et al.  Proceedings of the WWW2009 Workshop on Linked Data on the Web, LDOW 2009, Madrid, Spain, April 20, 2009 , 2009, LDOW.

[67]  Paul T. Groth,et al.  Representing distributed systems using the Open Provenance Model , 2011, Future Gener. Comput. Syst..

[68]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[69]  HendlerJames,et al.  A Semantic Web approach to the provenance challenge , 2008 .

[70]  Deborah L. McGuinness,et al.  A proof markup language for Semantic Web services , 2006, Inf. Syst..