Golden-Trail : Retrieving the Data History that Matters from a Comprehensive Provenance Repository Practice Paper

Experimental science is not a linear process. As we have noted in our recent prior work (2010a), publishable results routinely emerge at the end of an extended exploratory process, which unfolds over time and may involve multiple collaborators, who often interact only through data sharing facilities. This is particularly apparent in e-science, where experiments are embodied by computational processes which can be executed repeatedly and in many variations, over a large number of input configurations. These processes typically encompass a combination of well-defined processes encoded as scientific workflows, e.g., in Kepler (2006a), Taverna (2007a), etc., or as custom-made scripts, operations that move data across repositories, etc. Current implementations of e-science infrastructure are designed to support primarily the discovery and creation of valuable data outcomes, while result dissemination has largely been confined to “materials and methods” sections in traditional paper publications. Spurred in part by pressure from funding bodies, which are interested in maximizing their return on investment, the focus of e-science research is now shifting on the later phases of the scientific data lifecycle, namely the sharing and dissemination of scientific results, with the key requirements that the experiment be repeatable, and the results be verifiable and reusable (2009a). The notion of Research Objects (RO) has emerged in response to these needs (2010c). These are bundles of logically related artifacts that collectively encompass the history of a scientific outcome and can be used to support its validation and reproduction. They may include the description of the processes used (i.e., workflows), along with the provenance traces obtained by observing workflow execution. Importantly, the view of the experimental process they provide is focused on a selected few datasets that are destined for publication, rather than on the entire “raw” exploration. As a

[1]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..

[2]  Paolo Missier,et al.  Linking multiple workflow provenance traces for interoperable collaborative science , 2010, The 5th Workshop on Workflows in Support of Large-Scale Science.

[3]  David R. Newman,et al.  Why Linked Data is Not Enough for Scientists , 2010, 2010 IEEE Sixth International Conference on e-Science.

[4]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[5]  Peter M. A. Sloot,et al.  Understanding Collaborative Studies through Interoperable Workflow Provenance , 2010, IPAW.

[6]  Bertram Ludäscher,et al.  Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life , 2008, IPAW.

[7]  Yogesh L. Simmhan,et al.  The Open Provenance Model (v1.01) , 2008 .

[8]  Yolanda Gil,et al.  Provenance trails in the Wings/Pegasus system , 2008, Concurr. Comput. Pract. Exp..

[9]  Carole A. Goble,et al.  Taverna Workflows: Syntax and Semantics , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[10]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[11]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[12]  Paul Watson,et al.  e-Science Central: Cloud-based e-Science and its application to chemical property modelling , 2010 .

[13]  A. Coe,et al.  Journal Article , 2001 .