The Requirements of Using Provenance in e-Science Experiments

In e-Science experiments, it is vital to record the experimental process for later use such as in interpreting results, verifying that the correct process took place or tracing where data came from. The process that led to some data is called the provenance of that data, and a provenance architecture is the software architecture for a system that will provide the necessary functionality to record, store and use process documentation to determine the provenance of data items. However, there has been little principled analysis of what is actually required of a provenance architecture, so it is impossible to determine the functionality they would ideally support. In this paper, we present use cases for a provenance architecture from current experiments in biology, chemistry, physics and computer science, and analyse the use cases to determine the technical requirements of a generic, technology and application-independent architecture. We propose an architecture that meets these requirements, analyse its features compared with other approaches and evaluate a preliminary implementation by attempting to realise two of the use cases.

[1]  D. Lanter User-Centered Graphical User Interface Design for GIS April 1991 , 1991 .

[2]  Luc Moreau,et al.  Recording and Reasoning over Data Provenance in Web and Grid Services , 2003, OTM.

[3]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[4]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[5]  David P. Lanter,et al.  User-Centered Graphical User Interface Design for GIS (91-6) , 1991 .

[6]  Steve Taylor,et al.  Towards a Semantic Web Security Infrastructure , 2004 .

[7]  Carole A. Goble,et al.  Using Semantic Web Technologies for Representing E-science Provenance , 2004, SEMWEB.

[8]  Ian T. Foster,et al.  The virtual data grid: a new model and architecture for data-intensive collaboration , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[9]  Keishi Tajima,et al.  Archiving scientific data , 2004, TODS.

[10]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[11]  Karen Schuchardt,et al.  Multi-scale Science: Supporting Emerging Practice with Semantically Derived Provenance , 2003 .

[12]  Luc Moreau,et al.  Provenance of e-Science Experiments - Experience from Bioinformatics , 2003 .

[13]  Jim Waldo,et al.  The Jini Specification , 1999 .

[14]  Matjaz B. Juric,et al.  Business process execution language for web services , 2004 .

[15]  Luc Moreau,et al.  The semantic smart laboratory: a system for supporting the chemical eScientist. , 2004, Organic & biomolecular chemistry.

[16]  Anil Wipat,et al.  Experiences with e-Science workflow specification and enactment in bioinformatics , 2003 .

[17]  Paul T. Groth,et al.  A provenance-aware weighted fault tolerance scheme for service-based applications , 2005, Eighth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC'05).

[18]  David P. Lanter Lineage in GIS: The Problem and a Solution , 1990 .

[19]  Robert Stevens,et al.  Annotating, Linking and Browsing Provenance Logs for {e-Science} , 2003 .

[20]  Alexandra Poulovassilis,et al.  Tracing Data Lineage Using Schema Transformation Pathways , 2003, Knowledge Transformation for the Semantic Web.

[21]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[22]  Rolf Apweiler,et al.  The Proteomics Standards Initiative , 2003, Proteomics.

[23]  D. Lanter Design of a Lineage-Based Meta-Data Base for GIS , 1991 .

[24]  Amin Vahdat,et al.  Transparent Result Caching , 1997, USENIX Annual Technical Conference.

[25]  Arunprasad P. Marathe Tracing Lineage of Array Data , 2004, Journal of Intelligent Information Systems.

[26]  Michael Stonebraker,et al.  Data lineage and information density in database visualization , 1998 .

[27]  Linda C Hsieh-Wilson,et al.  A 'molecular switchboard'--covalent modifications to proteins and their impact on transcription. , 2004, Organic & biomolecular chemistry.

[28]  Paul T. Groth,et al.  Recording and using provenance in a protein compressibility experiment , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[29]  Michel C. A. Klein,et al.  Knowledge Transformation for the Semantic Web , 2003, Frontiers in Artificial Intelligence and Applications.

[30]  Victor Hock Kim Tan Interaction tracing for mobile agent security , 2004 .

[31]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[32]  James D. Myers,et al.  Re-integrating the research record , 2003, Comput. Sci. Eng..

[33]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[34]  Bharat K. Bhargava,et al.  E-notebook Middleware for Accountability and Reputation Based Trust in Distributed Data Sharing Communities , 2004, iTrust.

[35]  Jeremy G. Frey,et al.  Investigation of transport across an immiscible liquid/liquid interface. Electrochemical and second harmonic generation studies , 1996 .

[36]  Alan Pope,et al.  The CORBA reference guide - understanding the common object request broker architecture , 1998 .

[37]  Tony Andrews Business Process Execution Language for Web Services Version 1.1 , 2003 .

[38]  Gustavo Alonso,et al.  Geo-Opera: Workflow Concepts for Spatial Processes , 1997, SSD.

[39]  Richard A. Becker,et al.  Auditing of Data Analyses , 1986, SSDBM.

[40]  Michael Luck,et al.  A Protocol for Recording Provenance in Service-Oriented Grids , 2004, OPODIS.