noWorkflow: Capturing and Analyzing Provenance of Scripts

We propose noWorkflow, a tool that transparently captures provenance of scripts and enables reproducibility. Unlike existing approaches, noWorkflow is non-intrusive and does not require users to change the way they work --- users need not wrap their experiments in scientific workflow systems, install version control systems, or instrument their scripts. The tool leverages Software Engineering techniques, such as abstract syntax tree analysis, reflection, and profiling, to collect different types of provenance, including detailed information about the underlying libraries. We describe how noWorkflow captures multiple kinds of provenance and the different classes of analyses it supports: graph-based visualization; differencing over provenance trails; and inference queries.

[1]  Vanessa Braganholo,et al.  Implicit provenance gathering through configuration management , 2013, 2013 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE).

[2]  Marianne Winslett,et al.  Scientific and Statistical Database Management, 21st International Conference, SSDBM 2009, New Orleans, LA, USA, June 2-4, 2009, Proceedings , 2009, SSDBM.

[3]  James Frew,et al.  Automatic capture and reconstruction of computational provenance , 2008, Concurr. Comput. Pract. Exp..

[4]  Paul Watson,et al.  Provenance and data differencing for workflow reproducibility analysis , 2016, Concurr. Comput. Pract. Exp..

[5]  Andreas Schreiber,et al.  A Python Library for Provenance Recording and Querying , 2008, IPAW.

[6]  Cláudio T. Silva,et al.  The Provenance of Workflow Upgrades , 2010, IPAW.

[7]  Simon Miles Automatically Adapting Source Code to Document Provenance , 2010, IPAW.

[8]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[9]  David L. Donoho,et al.  A Universal Identifier for Computational Results , 2011, ICCS.

[10]  Andrew P. Davison Automated Capture of Experiment Context for Easier Reproducibility in Computational Research , 2012, Computing in Science & Engineering.

[11]  Cláudio T. Silva,et al.  Visual summaries for graph collections , 2013, 2013 IEEE Pacific Visualization Symposium (PacificVis).

[12]  Scott Klasky,et al.  Tracking Files in the Kepler Provenance Framework , 2009, SSDBM.

[13]  André van der Hoek,et al.  Design-time product line architectures for any-time variability , 2004, Sci. Comput. Program..

[14]  Andreas Wombacher,et al.  ProvenanceCurious: a tool to infer data provenance from scripts , 2013, EDBT '13.

[15]  Margo I. Seltzer,et al.  Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs , 2011, TaPP.

[16]  James Cheney,et al.  Provenance as Dependency Analysis , 2007, DBPL.

[17]  Ashish Gehani,et al.  Towards Automated Collection of Application-Level Data Provenance , 2012, TaPP.

[18]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[19]  Margo I. Seltzer,et al.  BURRITO: Wrapping Your Lab Notebook in Computational Infrastructure , 2012, TaPP.

[20]  Cláudio T. Silva,et al.  Bridging Workflow and Data Provenance Using Strong Links , 2010, SSDBM.

[21]  Stephan Diehl,et al.  Software Visualization - Visualizing the Structure, Behaviour, and Evolution of Software , 2007 .