The requirements of recording and using provenance in e- Science experiments

In e-Science experiments, it is vital to record the experimental process for later use such as in interpreting results, verifying that the correct process took place or tracing where data came from. The process that led to some data is called the provenance of that data, and a provenance architecture is the software architecture for a system that will provide the necessary functionality to record, store and use process documentation to determine the provenance of data items. However, there has been little principled analysis of what is actually required of a provenance architecture, so it is impossible to determine the functionality they would ideally support. In this paper, we present use cases for a provenance architecture from current experiments in biology, chemistry, physics and computer science, and analyse the use cases to determine the technical requirements of a generic, application-independent architecture. We propose an architecture that meets these requirements and evaluate a preliminary implementation by attempting to realise two of the use cases.

[1]  Paul T. Groth,et al.  A provenance-aware weighted fault tolerance scheme for service-based applications , 2005, Eighth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC'05).

[2]  Luc Moreau,et al.  Provenance of e-Science Experiments - Experience from Bioinformatics , 2003 .

[3]  D. Lanter Design of a Lineage-Based Meta-Data Base for GIS , 1991 .

[4]  Alan Pope,et al.  The CORBA reference guide - understanding the common object request broker architecture , 1998 .

[5]  Steve Taylor,et al.  Towards a Semantic Web Security Infrastructure , 2004 .

[6]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[7]  Michel C. A. Klein,et al.  Knowledge Transformation for the Semantic Web , 2003, Frontiers in Artificial Intelligence and Applications.

[8]  Keishi Tajima,et al.  Archiving scientific data , 2002, SIGMOD '02.

[9]  David P. Lanter,et al.  User-Centered Graphical User Interface Design for GIS (91-6) , 1991 .

[10]  Jeremy G. Frey,et al.  Investigation of transport across an immiscible liquid/liquid interface. Electrochemical and second harmonic generation studies , 1996 .

[11]  Victor Hock Kim Tan Interaction tracing for mobile agent security , 2004 .

[12]  Luc Moreau,et al.  Recording and Reasoning over Data Provenance in Web and Grid Services , 2003, OTM.

[13]  Jim Waldo,et al.  The Jini Specification , 1999 .

[14]  Paul T. Groth,et al.  Recording and using provenance in a protein compressibility experiment , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[15]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[16]  Matthew MacDonald,et al.  Web Services Architecture , 2004 .

[17]  Amin Vahdat,et al.  Transparent Result Caching , 1997, USENIX Annual Technical Conference.

[18]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[19]  Richard A. Becker,et al.  Auditing of Data Analyses , 1986, SSDBM.

[20]  Bharat K. Bhargava,et al.  E-notebook Middleware for Accountability and Reputation Based Trust in Distributed Data Sharing Communities , 2004, iTrust.

[21]  David P. Lanter Lineage in GIS: The Problem and a Solution , 1990 .

[22]  Karen Schuchardt,et al.  Multi-scale Science: Supporting Emerging Practice with Semantically Derived Provenance , 2003 .

[23]  Gustavo Alonso,et al.  Geo-Opera: Workflow Concepts for Spatial Processes , 1997, SSD.

[24]  Alexandra Poulovassilis,et al.  Tracing Data Lineage Using Schema Transformation Pathways , 2003, Knowledge Transformation for the Semantic Web.

[25]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[26]  Michael Stonebraker,et al.  Data lineage and information density in database visualization , 1998 .

[27]  Arunprasad P. Marathe Tracing Lineage of Array Data , 2004, Journal of Intelligent Information Systems.

[28]  Michael Luck,et al.  A Protocol for Recording Provenance in Service-Oriented Grids , 2004, OPODIS.

[29]  Anil Wipat,et al.  Experiences with e-Science workflow specification and enactment in bioinformatics , 2003 .

[30]  Ian T. Foster,et al.  The virtual data grid: a new model and architecture for data-intensive collaboration , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[31]  Tony Andrews Business Process Execution Language for Web Services Version 1.1 , 2003 .

[32]  Robert Stevens,et al.  Annotating, Linking and Browsing Provenance Logs for {e-Science} , 2003 .

[33]  James D. Myers,et al.  Re-integrating the research record , 2003, Comput. Sci. Eng..

[34]  V. Vianu,et al.  Edinburgh Why and Where: A Characterization of Data Provenance , 2017 .

[35]  David M. Booth,et al.  Web Services Architecture , 2004 .