A data lineage model for distributed sub-image processing

An important challenge facing e-Science is the development of scalable systems and analysis techniques that allow client applications to locate data and services in increasingly large-scale distributed environments. e-Science Systems should achieve three main goals: (i) efficient and selective processing of data, (ii) support network collaboration without clogging distribution networks; and (iii) allow transparency of experiments through repeatability and verifiability of experiments. Several systems have addressed limited combinations of these properties, but we address all three in this work. We describe the architecture and implementation of such a framework in Astro-WISE, an astronomical approach to distributed data processing, discovery and retrieval of datasets that achieves scalability via dynamic linking (data lineage) maintained within the system. We show that lineage data collected during the processing and analysis of datasets can be reused to perform selective reprocessing(at sub-image level)ondatasets while the remainder of the dataset is untouched, a rather difficult process to automate without lineage.

[1]  Alexander S. Szalay,et al.  The Sloan Digital Sky Survey and beyond , 2008, SGMD.

[2]  James P. Ahrens,et al.  Provenance in Comparative Analysis: A Study in Cosmology , 2008, Computing in Science & Engineering.

[3]  Yogesh L. Simmhan,et al.  Karma2: Provenance Management for Data-Driven Workflows , 2008, Int. J. Web Serv. Res..

[4]  Geoffrey C. Fox,et al.  Examining the Challenges of Scientific Workflows , 2007, Computer.

[5]  Paul T. Groth,et al.  Provenance-based validation of e-science experiments , 2005 .

[6]  Hui Deng,et al.  C-SWF: A Lightweight Scientific Workflow System for Astronomical Data Processing , 2009, 2009 Second International Workshop on Computer Science and Engineering.

[7]  Perry Greenfield Reaching for the Stars with Python , 2007, Computing in Science & Engineering.

[8]  Andrey N. Belikov,et al.  Merging Grid Technologies , 2010, Journal of Grid Computing.

[9]  Cláudio T. Silva,et al.  Using Provenance to Support Real-Time Collaborative Design of Workflows , 2008, IPAW.

[10]  Jaan Kiusalaas,et al.  Numerical methods in engineering with Python , 2005 .

[11]  E. Greisen,et al.  Representations of celestial coordinates in FITS , 2002, astro-ph/0207413.

[12]  Cláudio T. Silva,et al.  Querying and Creating Visualizations by Analogy , 2007, IEEE Transactions on Visualization and Computer Graphics.

[13]  Edwin Valentijn,et al.  Astro-WISE: Tracing and Using Lineage for Scientific Data Processing , 2009, 2009 International Conference on Network-Based Information Systems.

[14]  Paul T. Groth,et al.  Provenance-based validation of e-science experiments , 2005, J. Web Semant..

[15]  Andre Heck,et al.  Information Handling in Astronomy - Historical Vistas , 2002 .

[16]  Scott Klasky,et al.  Introduction to scientific workflow management and the Kepler system , 2006, SC.

[17]  James Frew,et al.  Lineage retrieval for scientific data processing: a survey , 2005, CSUR.

[18]  Simon Miles Automatically Adapting Source Code to Document Provenance , 2010, IPAW.

[19]  E. W. Greisen,et al.  Representations of spectral coordinates in FITS , 2005 .

[20]  Ralf Bender,et al.  Astro-WISE: Chaining to the Universe , 2007 .

[21]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[22]  Rajkumar Buyya,et al.  A taxonomy of scientific workflow systems for grid computing , 2005, SGMD.