Connecting Scientific Data to Scientific Experiments with Provenance

As scientific workflows and the data they operate on, grow in size and complexity, the task of defining how those workflows should execute (which resources to use, where the resources must be in readiness for processing etc.) becomes proportionally more difficult. While "workflow compilers", such as Pegasus, reduce this burden, a further problem arises: since specifying details of execution is now automatic, a workflow's results are harder to interpret, as they are partly due to specifics of execution. By automating steps between the experiment design and its results, we lose the connection between them, hindering interpretation of results. To reconnect the scientific data with the original experiment, we argue that scientists should have access to the full provenance of their data, including not only parameters, inputs and intermediary data, but also the abstract experiment, refined into a concrete execution by the "workflow compiler". In this paper, we describe preliminary work on adapting Pegasus to capture the process of workflow refinement in the PASOA provenance system.

[1]  Juliana Freire,et al.  Tackling the Provenance Challenge one layer at a time , 2008, Concurr. Comput. Pract. Exp..

[2]  Paul T. Groth,et al.  PReServ: Provenance Recording for Services , 2005 .

[3]  Ian Foster,et al.  The First Provenance Challenge , 2008 .

[4]  Eric Gilbert,et al.  Virtual data Grid middleware services for data‐intensive science , 2006, Concurr. Comput. Pract. Exp..

[5]  Yogesh L. Simmhan,et al.  Performance Evaluation of the Karma Provenance Framework for Scientific Workflows , 2006, IPAW.

[6]  Yolanda Gil,et al.  Wings for Pegasus: A Semantic Approach to Creating Very Large Scientific Workflows , 2006, OWLED.

[7]  Carole A. Goble,et al.  Mining Taverna's semantic web of provenance , 2008, Concurr. Comput. Pract. Exp..

[8]  James Frew,et al.  Lineage retrieval for scientific data processing: a survey , 2005, CSUR.

[9]  Dennis Gannon,et al.  Query capabilities of the Karma provenance framework , 2008 .

[10]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[11]  Cláudio T. Silva,et al.  Managing Rapidly-Evolving Scientific Workflows , 2006, IPAW.

[12]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[13]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[14]  Robert Stevens,et al.  Annotating, Linking and Browsing Provenance Logs for {e-Science} , 2003 .

[15]  Luc Moreau,et al.  Provenance and Annotation of Data, International Provenance and Annotation Workshop, IPAW 2006, Chicago, IL, USA, May 3-5, 2006, Revised Selected Papers , 2006, IPAW.

[16]  Paul T. Groth,et al.  An Architecture for Provenance Systems , 2006 .

[17]  Paul T. Groth,et al.  The Requirements of Using Provenance in e-Science Experiments , 2007, Journal of Grid Computing.

[18]  Yogesh L. Simmhan,et al.  Query capabilities of the Karma provenance framework , 2008, Concurr. Comput. Pract. Exp..

[19]  Daniel S. Katz,et al.  Montage: a grid-enabled engine for delivering custom science-grade mosaics on demand , 2004, SPIE Astronomical Telescopes + Instrumentation.

[20]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[21]  Ewa Deelman,et al.  Pegasus: Mapping Large-Scale Workflows to Distributed Resources , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[22]  Carole A. Goble,et al.  Using Semantic Web Technologies for Representing E-science Provenance , 2004, SEMWEB.