Engineering Choices for Open World Provenance

This work outlines engineering decisions required to support a provenance system in an open world where systems are not under any common control and use many different technologies. Real U.S. government applications have shown us the need for specialized identity techniques, flexible storage, scalability testing, protection of sensitive information, and customizable provenance queries. We analyze tradeoffs for approaches to each area, focusing more on maintaining graph connectivity and breadth of capture, rather than on fine-grained/detailed capture as in other works. We implement each technique in the PLUS system, test its real-time efficiency, and describe the results.

[1]  Yulai Xie,et al.  A hybrid approach for efficient provenance storage , 2012, CIKM '12.

[2]  Ian Foster,et al.  Special Issue: The First Provenance Challenge , 2008 .

[3]  James Frew,et al.  Automatic capture and reconstruction of computational provenance , 2008 .

[4]  Cláudio T. Silva,et al.  Querying and re-using workflows with VsTrails , 2008, SIGMOD Conference.

[5]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[6]  Marta Mattoso,et al.  Provenance traces of the swift parallel scripting system , 2013, EDBT '13.

[7]  Shiyong Lu,et al.  RDFProv: A relational RDF store for querying and managing scientific workflow provenance , 2010, Data Knowl. Eng..

[8]  Jim Webber,et al.  Graph Databases: New Opportunities for Connected Data , 2013 .

[9]  Paolo Missier,et al.  Extracting PROV provenance traces from Wikipedia history pages , 2013, EDBT '13.

[10]  Paul T. Groth,et al.  Provenance: An Introduction to PROV , 2013, Provenance.

[11]  Adriane Chapman,et al.  Scalable Access Controls for Lineage , 2009, Workshop on the Theory and Practice of Provenance.

[12]  Xiaoyun Wang,et al.  How to Break MD5 and Other Hash Functions , 2005, EUROCRYPT.

[13]  Ronald Cramer,et al.  Advances in Cryptology - EUROCRYPT 2005, 24th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Aarhus, Denmark, May 22-26, 2005, Proceedings , 2005, EUROCRYPT.

[14]  Val Tannen,et al.  Annotated XML: queries and provenance , 2008, PODS.

[15]  Cong Yu,et al.  Querying structured text in an XML database , 2003, SIGMOD '03.

[16]  Adriane Chapman,et al.  PLUS: A provenance manager for integrated information , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[17]  Óscar Corcho,et al.  A workflow PROV-corpus based on taverna and wings , 2013, EDBT '13.

[18]  Adriane Chapman,et al.  Getting It Together: Enabling Multi-organization Provenance Exchange , 2011, TaPP.

[19]  Adriane Chapman,et al.  It's About the Data: Provenance as a Tool for Assessing Data Fitness , 2012, TaPP.

[20]  Adriane Chapman,et al.  Surrogate Parenthood: Protected and Informative Graphs , 2011, Proc. VLDB Endow..

[21]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[22]  Bertram Ludäscher,et al.  Efficient provenance storage over nested data collections , 2009, EDBT '09.

[23]  Jennifer Widom,et al.  RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows , 2011, Proc. VLDB Endow..

[24]  Alun D. Preece,et al.  Managing information quality in e-science: the qurator workbench , 2007, SIGMOD '07.

[25]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..