Provenance for Scientific Workflows Towards Reproducible Research

eScience has established itself as a key pillar in scientific discovery, continuing the evolution of the scientific discovery process from theoretical to empirical to computational science [13]. Extensive deployment of instruments and sensors that observe the physical and biological world are bringing in large and diverse data to the reach of scientists. Often, that data is more frequently shared due to the cost of the instrumentation or because of the desire to address larger scale and/or cross-discipline science such as climate change. There is a tangible push towards building large, global-scale instruments [2][3] and wide deployment of sensors [1] with the data they generate being shared by a large collaboration which has access to the data generated by these instruments. Indeed, funding agencies and publishers are starting to insist that scientists share both results and raw datasets, along with the provenance for how the result was produced from the raw dataset(s), to foster open science [4]. Scientific workflows have emerged as the de facto model for researchers to process, transform and analyze scientific data. These workflows may run on the users desktop or in the Cloud and the workflow framework is geared towards easy composition of scientific experiments, allocation and scheduling of resources, orchestration and monitoring of execution, and collecting provenance [20]. The goal of the Trident Scientific Workflow System is to provide a specialized programming environment to simplify the programming effort required by scientists to orchestrate a computational science experiment.

[1]  Amit P. Sheth,et al.  Janus: From Workflows to Semantic Provenance and Linked Open Data , 2010, IPAW.

[2]  Yogesh L. Simmhan,et al.  Provenance Information Model of Karma Version 3 , 2009, 2009 Congress on Services - I.

[3]  Bertram Ludäscher,et al.  Scientific workflow design for mere mortals , 2009, Future Gener. Comput. Syst..

[4]  David Charles De Roure,et al.  myExperiment: social networking for workflow-using e-scientists , 2007, WORKS '07.

[5]  Cláudio T. Silva,et al.  Examining Statistics of Workflow Evolution Provenance: A First Study , 2008, SSDBM.

[6]  Yogesh L. Simmhan,et al.  Building Reliable Data Pipelines for Managing Community Data Using Scientific Workflows , 2009, 2009 Fifth IEEE International Conference on e-Science.

[7]  Roger S. Barga,et al.  Capturing Workflow Event Data for Monitoring, Performance Analysis, and Management of Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[8]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[9]  Luc Moreau,et al.  The Open Provenance Model , 2007 .

[10]  Yogesh L. Simmhan,et al.  The Trident Scientific Workflow Workbench , 2008, 2008 IEEE Fourth International Conference on eScience.

[11]  Paul T. Groth,et al.  Recording and using provenance in a protein compressibility experiment , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[12]  Yogesh L. Simmhan,et al.  Analysis of approaches for supporting the Open Provenance Model: A case study of the Trident workflow workbench , 2011, Future Gener. Comput. Syst..

[13]  Elias A Zerhouni NIH Public Access Policy , 2004, Science.

[14]  E. Keil The Large Hadron Collider LHC , 1996 .

[15]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[16]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .

[17]  Fabio Casati,et al.  Workflow Evolution , 1996, ER.