Abstract Provenance Graphs: Anticipating and Exploiting Schema-Level Data Provenance

Provenance graphs capture flow and dependency information recorded during scientific workflow runs, which can be used subsequently to interpret, validate, and debug workflow results. In this paper, we propose the new concept of Abstract Provenance Graphs (APGs). APGs are created via static analysis of a configured workflow W and input data schema, i.e., before W is actually executed. They summarize all possible provenance graphs the workflow W can create with input data of type τ, that is, for each input v ∈ τ there exists a graph homomorphism \(\mathcal H_v\) between the concrete and abstract provenance graph. APGs are helpful during workflow construction since (1) they make certain workflow design-bugs (e.g., selecting none or wrong input data for the actors) easy to spot; and (2) show the evolution of the overall data organization of a workflow. Moreover, after workflows have been run, APGs can be used to validate concrete provenance graphs. A more detailed version of this work is available as [14].

[1]  Sanjeev Khanna,et al.  Optimizing user views for workflows , 2009, ICDT '09.

[2]  James Cheney,et al.  FLUX: functional updates for XML , 2008, ICFP.

[3]  Bertram Ludäscher,et al.  XML-based computation for scientific workflows , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[4]  Carole A. Goble,et al.  The Data Playground: An Intuitive Workflow Specification Environment , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[5]  Bertram Ludäscher,et al.  Exploring Scientific Workflow Provenance Using Hybrid Queries over Nested Data and Lineage Graphs , 2009, SSDBM.

[6]  Benjamin C. Pierce,et al.  Regular expression types for XML , 2005, ACM Trans. Program. Lang. Syst..

[7]  Derick Wood,et al.  One-Unambiguous Regular Languages , 1998, Inf. Comput..

[8]  Luc Moreau,et al.  The Open Provenance Model: An Overview , 2008, IPAW.

[9]  James Cheney,et al.  Lux: A Lightweight, Statically Typed XML Update Language , 2007, PLAN-X.

[10]  Ian J. Taylor,et al.  The Triana Workflow Environment: Architecture and Applications , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[11]  Bertram Ludäscher,et al.  Scientific workflow design for mere mortals , 2009, Future Gener. Comput. Syst..

[12]  James Cheney Provenance, XML, and the Scientific Web , 2009 .

[13]  Susan B. Davidson,et al.  Zoom*UserViews: Querying Relevant Provenance in Workflow Systems , 2007, VLDB.

[14]  Bertram Ludäscher,et al.  Scientific workflow design with data assembly lines , 2009, WORKS '09.

[15]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[16]  James Cheney,et al.  A Graph Model of Data and Workflow Provenance , 2010, TaPP.

[17]  Bertram Ludäscher,et al.  A navigation model for exploring scientific workflow provenance graphs , 2009, WORKS '09.

[18]  Giuseppe Castagna,et al.  CDuce: an XML-centric general-purpose language , 2003, ACM SIGPLAN Notices.

[19]  Yogesh L. Simmhan,et al.  The Open Provenance Model core specification (v1.1) , 2011, Future Gener. Comput. Syst..