Querying and managing opm-compliant scientific workflow provenance

Provenance, the metadata that records the derivation history of scientific results, is important in scientific workflows to interpret, validate, and analyze the result of scientific computing. Recently, to promote and facilitate interoperability among heterogeneous provenance systems, the Open Provenance Model (OPM) has been proposed and has played an important role in the community. In this dissertation, to efficiently query and manage OPM-compliant provenance, we first propose a provenance collection framework that collects both prospective provenance, which captures an abstract workflow specification as a recipe for future data derivation and retrospective provenance, which captures past workflow execution and data derivation information. We then propose a relational database-based provenance system, called OPMPROV that stores, reasons, and queries prospective and retrospective provenance, which is OPM-compliant provenance. We finally propose OPQL, an OPM-level provenance query language, that is directly defined over the OPM model. An OPQL query takes an OPM graph as input and produces an OPM graph as output; therefore, OPQL queries are not tightly coupled to the underlying provenance storage strategies. Our provenance store, provenance collection framework, and provenance query language feature the native support of the OPM model.

[1]  Margo I. Seltzer,et al.  Choosing a Data Model and Query Language for Provenance , 2008, IPAW 2008.

[2]  Bertram Ludäscher,et al.  Exploring Scientific Workflow Provenance Using Hybrid Queries over Nested Data and Lineage Graphs , 2009, SSDBM.

[3]  Steffen Staab,et al.  DiALog: A Distributed Model for Capturing Provenance and Auditing Information , 2010, Int. J. Web Serv. Res..

[4]  Debmalya Panigrahi,et al.  Preserving Module Privacy in Workflow Provenance , 2010, ArXiv.

[5]  Jing Hua,et al.  Service-Oriented Architecture for VIEW: A Visual Scientific Workflow Management System , 2008, 2008 IEEE International Conference on Services Computing.

[6]  Ambuj K. Singh,et al.  Graphs-at-a-time: query language and access methods for graph databases , 2008, SIGMOD Conference.

[7]  Shiyong Lu,et al.  RDFProv: A relational RDF store for querying and managing scientific workflow provenance , 2010, Data Knowl. Eng..

[8]  Yong Zhao,et al.  A Logic Programming Approach to Scientific Workflow Provenance Querying , 2008, IPAW.

[9]  Cláudio T. Silva,et al.  Using Mediation to Achieve Provenance Interoperability , 2009, 2009 Congress on Services - I.

[10]  Bingsheng He,et al.  Adaptive Index Utilization in Memory-Resident Structural Joins , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11]  Bertram Ludäscher,et al.  Provenance in Scientific Workflow Systems , 2007, IEEE Data Eng. Bull..

[12]  Susan B. Davidson,et al.  Towards a Model of Provenance and User Views in Scientific Workflows , 2006, DILS.

[13]  Juliana Freire,et al.  Using VisTrails and Provenance for Teaching Scientific Visualization , 2011, Comput. Graph. Forum.

[14]  Sanjeev Khanna,et al.  Enabling Privacy in Provenance-Aware Workflow Systems , 2011, CIDR.

[15]  Paolo Missier,et al.  Exploiting Provenance to Make Sense of Automated Decisions in Scientific Workflows , 2008, IPAW.

[16]  Daniel Crawl,et al.  A Provenance-Based Fault Tolerance Mechanism for Scientific Workflows , 2008, IPAW.

[17]  Ewa Deelman,et al.  Scientific workflows and clouds , 2010, ACM Crossroads.

[18]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[19]  Bertram Ludäscher,et al.  Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life , 2008, IPAW.

[20]  Carlo Zaniolo,et al.  Efficient Structural Joins on Indexed XML Documents , 2002, VLDB.

[21]  Alun D. Preece,et al.  An ontology‐based approach to handling information quality in e‐Science , 2008, Concurr. Comput. Pract. Exp..

[22]  W. Marsden I and J , 2012 .

[23]  Luc Moreau,et al.  Recording and Reasoning over Data Provenance in Web and Grid Services , 2003, OTM.

[24]  Shiyong Lu,et al.  Prospective and Retrospective Provenance Collection in Scientific Workflow Environments , 2010, 2010 IEEE International Conference on Services Computing.

[25]  Dennis Gannon,et al.  Query capabilities of the Karma provenance framework , 2008 .

[26]  Paul T. Groth,et al.  The requirements of recording and using provenance in e- Science experiments , 2005 .

[27]  Paul T. Groth,et al.  A model of process documentation to determine provenance in mash-ups , 2009, TOIT.

[28]  Mladen A. Vouk,et al.  Quality of service and scientific workflows , 1996, Quality of Numerical Software.

[29]  Schahram Dustdar,et al.  Service Provenance in QoS-Aware Web Service Runtimes , 2009, 2009 IEEE International Conference on Web Services.

[30]  Paul T. Groth,et al.  The Requirements of Using Provenance in e-Science Experiments , 2007, Journal of Grid Computing.

[31]  Shiyong Lu,et al.  A scientific workflow system for genomic data analysis , 2010 .

[32]  Yangjun Chen,et al.  An Efficient Algorithm for Answering Graph Reachability Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[33]  Xiao Liu,et al.  A cost-effective strategy for intermediate data storage in scientific cloud workflow systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[34]  Paul T. Groth,et al.  Recording and using provenance in a protein compressibility experiment , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[35]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[36]  Xiao Liu,et al.  On-demand minimum cost benchmarking for intermediate dataset storage in scientific cloud workflow systems , 2011, J. Parallel Distributed Comput..

[37]  Cláudio T. Silva,et al.  Tackling the Provenance Challenge one layer at a time , 2008 .

[38]  Cláudio T. Silva,et al.  Provenance for Visualizations: Reproducibility and Beyond , 2007, Computing in Science & Engineering.

[39]  Sanjeev Khanna,et al.  Differencing Provenance in Scientific Workflows , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[40]  John Abraham,et al.  Distributed Storage and Querying Techniques for a Semantic Web of Scientific Workflow Provenance , 2010, 2010 IEEE International Conference on Services Computing.

[41]  Yogesh L. Simmhan,et al.  Special Section: The third provenance challenge on using the open provenance model for interoperability , 2011, Future Gener. Comput. Syst..

[42]  Shiyong Lu,et al.  Efficient schema-based XML-to-Relational data mapping , 2007, Inf. Syst..

[43]  Hans De Sterck,et al.  CloudWF: A Computational Workflow System for Clouds Based on Hadoop , 2009, CloudCom.

[44]  Marta Mattoso,et al.  Towards a Taxonomy of Provenance in Scientific Workflow Management Systems , 2009, 2009 Congress on Services - I.

[45]  Bertram Ludäscher,et al.  Efficient provenance storage over nested data collections , 2009, EDBT '09.

[46]  Amit P. Sheth,et al.  Janus: From Workflows to Semantic Provenance and Linked Open Data , 2010, IPAW.

[47]  Bertram Ludäscher,et al.  CONCURRENCY AND COMPUTATION : PRACTICE AND EXPERIENCE Concurrency Computat , 2008 .

[48]  Carmem S. Hara,et al.  Querying and Managing Provenance through User Views in Scientific Workflows , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[49]  Yang Xiang,et al.  3-HOP: a high-compression indexing scheme for reachability query , 2009, SIGMOD Conference.

[50]  Bertram Ludäscher,et al.  A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows , 2006, IPAW.

[51]  Susan B. Davidson,et al.  PDiffView: Viewing the Difference in Provenance of Workflow Results , 2009, Proc. VLDB Endow..

[52]  Shiyong Lu,et al.  Atomicity and provenance support for pipelined scientific workflows , 2009 .

[53]  Susan B. Davidson,et al.  Zoom*UserViews: Querying Relevant Provenance in Workflow Systems , 2007, VLDB.

[54]  Ann L. Chervenak,et al.  Data Management Challenges of Data-Intensive Scientific Workflows , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[55]  Margo Seltzer,et al.  PASSing the provenance challenge , 2008 .

[56]  Peter M. A. Sloot,et al.  Understanding Collaborative Studies through Interoperable Workflow Provenance , 2010, IPAW.

[57]  Shiyong Lu,et al.  A Collectional Data Model for Scientific Workflow Composition , 2010, 2010 IEEE International Conference on Web Services.

[58]  Luc Moreau,et al.  Extracting causal graphs from an open provenance data model , 2008 .

[59]  Bertram Ludäscher,et al.  Scientific workflow design for mere mortals , 2009, Future Gener. Comput. Syst..

[60]  Bertram Ludäscher,et al.  Techniques for efficiently querying scientific workflow provenance graphs , 2010, EDBT '10.

[61]  Xiao Liu,et al.  A data placement strategy in scientific cloud workflows , 2010, Future Gener. Comput. Syst..

[62]  Cláudio T. Silva,et al.  Managing Rapidly-Evolving Scientific Workflows , 2006, IPAW.

[63]  Michael Luck,et al.  A Protocol for Recording Provenance in Service-Oriented Grids , 2004, OPODIS.

[64]  Shiyong Lu,et al.  A Dataflow-Based Scientific Workflow Composition Framework , 2012, IEEE Transactions on Services Computing.

[65]  Ryan A. Rossi,et al.  Polyphony: A Workflow Orchestration Framework for Cloud Computing , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[66]  Jan Van den Bussche,et al.  Mapping the NRC Dataflow Model to the Open Provenance Model , 2008, IPAW.

[67]  Jing Hua,et al.  A Reference Architecture for Scientific Workflow Management Systems and the VIEW SOA Solution , 2009, IEEE Transactions on Services Computing.

[68]  Paul T. Groth,et al.  Recording Process Documentation for Provenance , 2009, IEEE Transactions on Parallel and Distributed Systems.

[69]  Yogesh L. Simmhan,et al.  Karma2: Provenance Management for Data-Driven Workflows , 2008, Int. J. Web Serv. Res..

[70]  Shiyong Lu,et al.  Storing and Querying Scientific Workflow Provenance Metadata Using an RDBMS , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[71]  Yong Zhao,et al.  Applying the Virtual Data Provenance Model , 2006, IPAW.

[72]  Bertram Ludäscher,et al.  Approaches for Exploring and Querying Scientific Workflow Provenance Graphs , 2010, IPAW.

[73]  Johan Tordsson,et al.  Three fundamental dimensions of scientific workflow interoperability: Model of computation, language, and execution environment , 2010, Future Gener. Comput. Syst..

[74]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[75]  Carole A. Goble,et al.  Data Lineage Model for Taverna Workflows with Lightweight Annotation Requirements , 2008, IPAW.

[76]  Susan B. Davidson,et al.  Addressing the provenance challenge using ZOOM , 2008, Concurr. Comput. Pract. Exp..

[77]  Yolanda Gil,et al.  Provenance trails in the Wings-Pegasus system , 2008 .

[78]  Sanjeev Khanna,et al.  An optimal labeling scheme for workflow provenance using skeleton labels , 2010, SIGMOD Conference.

[79]  Paul T. Groth,et al.  Representing distributed systems using the Open Provenance Model , 2011, Future Gener. Comput. Syst..

[80]  Yogesh L. Simmhan,et al.  A Framework for Collecting Provenance in Data-Centric Scientific Workflows , 2006, 2006 IEEE International Conference on Web Services (ICWS'06).

[81]  Shiyong Lu,et al.  A MapReduce-Enabled Scientific Workflow Composition Framework , 2009, 2009 IEEE International Conference on Web Services.

[82]  Tram Truong Huu,et al.  Virtual Resources Allocation for Workflow-Based Applications Distribution on a Cloud Infrastructure , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[83]  Yogesh L. Simmhan,et al.  A survey of data provenance in e-science , 2005, SGMD.

[84]  Shiyong Lu,et al.  Secure abstraction views for scientific workflow provenance querying , 2010, IEEE Transactions on Services Computing.

[85]  Paul T. Groth,et al.  Pipeline-centric provenance model , 2009, WORKS '09.

[86]  Raymond A. Paul,et al.  Data provenance in SOA: security, reliability, and integrity , 2007, Service Oriented Computing and Applications.

[87]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[88]  Olaf Hartig Provenance Information in the Web of Data , 2009, LDOW.

[89]  Marta Mattoso,et al.  A Strategy for Provenance Gathering in Distributed Scientific Workflows , 2009, 2009 Congress on Services - I.

[90]  Yogesh L. Simmhan,et al.  Provenance Information Model of Karma Version 3 , 2009, 2009 Congress on Services - I.

[91]  Yanbo Han,et al.  ViPen: A Model Supporting Knowledge Provenance for Exploratory Service Composition , 2010, 2010 IEEE International Conference on Services Computing.

[92]  Bertram Ludäscher,et al.  A navigation model for exploring scientific workflow provenance graphs , 2009, WORKS '09.

[93]  Carole A. Goble,et al.  Requirements and Services for Metadata Management , 2007, IEEE Internet Computing.

[94]  V. Vianu,et al.  Edinburgh Why and Where: A Characterization of Data Provenance , 2017 .

[95]  Schahram Dustdar,et al.  Selective Service Provenance in the VRESCo Runtime , 2010, Int. J. Web Serv. Res..

[96]  Shiyong Lu,et al.  Scientific Workflow Provenance Querying with Security Views , 2008, 2008 The Ninth International Conference on Web-Age Information Management.

[97]  Bertram Ludäscher,et al.  Provenance browser: Displaying and querying scientific workflow provenance graphs , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[98]  Yogesh L. Simmhan,et al.  Provenance for Scientific Workflows Towards Reproducible Research , 2010, IEEE Data Eng. Bull..

[99]  Peter Buneman,et al.  Provenance in databases , 2009, SIGMOD '07.

[100]  Carole A. Goble,et al.  Mining Taverna's semantic web of provenance , 2008, Concurr. Comput. Pract. Exp..

[101]  Shiyong Lu,et al.  OPQL: A First OPM-Level Query Language for Scientific Workflow Provenance , 2011, 2011 IEEE International Conference on Services Computing.

[102]  Norman W. Paton,et al.  Fine-grained and efficient lineage querying of collection-based workflow provenance , 2010, EDBT '10.

[103]  Ilkay Altintas,et al.  Provenance Collection Support in the Kepler Scientific Workflow System , 2006, IPAW.

[104]  Sanjeev Khanna,et al.  Data Provenance: Some Basic Issues , 2000, FSTTCS.

[105]  Ilkay Altintas,et al.  Lifecycle of Scientific Workflows and their Provenance: A Usage Perspective , 2008, 2008 IEEE Congress on Services - Part I.