Interactive provenance summaries for reproducible science

Recorded provenance facilitates reproducible science. Provenance metadata can help determine how data were possibly transformed, processed, and derived from original sources. While provenance is crucial for verification and validation, there remains the issue of the granularity — detail at which provenance data must be provided to a user, especially for conducting reproducible science. When data are reproduced successfully the need for detailed provenance is minimal and an essence of the recorded provenance suffices. However, when data are not reproduced correctly users want to quickly drill down into fine-grained provenance to understand causes for failure. In this paper, we describe a drill-up/drill-down method for exploring provenance traces. The drill-up method summarizes the trace by grouping nodes and edges of the trace that have same derivation histories. The method preserves provenance data flow semantics. The drill-down method compares summary groups and ranks groups that may have information about the errors. Both the methods are implemented in an efficient manner using light-weight data structures so as to be suitable for reproducible science. We conduct a thorough experimental analysis to show how the operators perform in compressing and expanding real provenance graphs.

[1]  Jarek Nabrzyski,et al.  An Ontology Design Pattern towards Preservation of Computational Experiments , 2015, LISC@ISWC.

[2]  D. Lettenmaier,et al.  Surface soil moisture parameterization of the VIC-2L model: Evaluation and modification , 1996 .

[3]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[4]  Margo I. Seltzer,et al.  Local clustering in provenance graphs , 2013, CIKM.

[5]  Ian T. Foster,et al.  Using Provenance for Repeatability , 2013, TaPP.

[6]  Marta Mattoso,et al.  Provenance management in Swift , 2011, Future Gener. Comput. Syst..

[7]  M. Livny,et al.  PARROT: AN APPLICATION ENVIRONMENT FOR DATA-INTENSIVE COMPUTING ((PREPRINT VERSION)) , 2005 .

[8]  Susan B. Davidson,et al.  Addressing the provenance challenge using ZOOM , 2008, Concurr. Comput. Pract. Exp..

[9]  Ashish Gehani,et al.  Tracking and Sketching Distributed Data Provenance , 2010, 2010 IEEE Sixth International Conference on e-Science.

[10]  Ian T. Foster,et al.  LDV: Light-weight database virtualization , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[11]  Philippe Bonnet,et al.  Computational reproducibility: state-of-the-art, challenges, and database research opportunities , 2012, SIGMOD Conference.

[12]  Bertram Ludäscher,et al.  Provenance browser: Displaying and querying scientific workflow provenance graphs , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[13]  Jignesh M. Patel,et al.  Efficient aggregation for graph summarization , 2008, SIGMOD Conference.