Dynamic Pipeline Changes in Scientific Data Processing

Understanding the difference between data objects is a major problem especially in a scientific collaboration which allows scientists to collectively reuse data, modify and adapt scripts developed by their peers to process data while publishing the results to a centralized data store. Although data provenance has been significantly studied to address the origins of a data item, it does not however addresses changes made to the source code. Systems often appear as a large number of modules each containing hundreds of lines of code. It is, in general, not obvious which parts of source code contributed to the change in data object. The paper introduces the Class-Based Object Versioning framework, which overcomes some of the shortcomings of popular versioning systems (e.g. CVS, SVN) in maintaining data and code provenance information in scientific computing environments. The framework automatically identifies and captures useful fine-grained changes in the data and code of scripts that perform scientific experiments so that important information about intermediate stages (i.e. unrecorded changes in experiment parameters and procedures) can be identified and analyzed.

[1]  Ralf Bender,et al.  Astro-WISE: Chaining to the Universe , 2007 .

[2]  Edwin Valentijn,et al.  Astro-WISE: Tracing and Using Lineage for Scientific Data Processing , 2009, 2009 International Conference on Network-Based Information Systems.

[3]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[4]  Zheng Wang,et al.  BMAT - A Binary Matching Tool for Stale Profile Propagation , 2000, J. Instr. Level Parallelism.

[5]  Leonardo Murta,et al.  Comparison and versioning of scientific workflows , 2009, 2009 ICSE Workshop on Comparison and Versioning of Software Models.

[6]  Andrey N. Belikov,et al.  Merging Grid Technologies , 2010, Journal of Grid Computing.

[7]  Cláudia Maria Lima Werner,et al.  Odyssey-VCS: a flexible version control system for UML model elements , 2005, SCM '05.

[8]  John F. Roddick,et al.  A survey of schema versioning issues for database systems , 1995, Inf. Softw. Technol..

[9]  Romain Robbes,et al.  Versioning systems for evolution research , 2005, Eighth International Workshop on Principles of Software Evolution (IWPSE'05).

[10]  Michael H. Böhlen,et al.  Versioned Relations: Support for Conditional Schema Changes and Schema Versioning , 2007, DASFAA.

[11]  Alessandro Orso,et al.  A differencing algorithm for object-oriented programs , 2004 .

[12]  Luc Moreau,et al.  The Foundations for Provenance on the Web , 2010, Found. Trends Web Sci..

[13]  Daniel Jackson,et al.  Semantic Diff: a tool for summarizing the effects of modifications , 1994, Proceedings 1994 International Conference on Software Maintenance.