The virtual data grid: a new model and architecture for data-intensive collaboration

It is increasingly common to encounter communities engaged in the collaborative analysis and transformation of large quantities of data over extended periods of time. I argue that these communities require a scalable system for managing, tracing, exploring and communicating the derivation and analysis of diverse data objects. Such a system could bring significant productivity increases facilitating discovery, understanding, assessment, and sharing of both data and transformation resources for computation, storage, and collaboration. I define a model and architecture for a virtual data grid capable of addressing these requirements. I define a broadly applicable model of a "typed dataset" as the unit of derivation tracking, and simple constructs for describing how datasets are derived from transformations and from other datasets. I also define mechanisms for integrating with, and adapting to, existing data management systems and transformation and analysis tools, as well as grid mechanisms for distributed resource management and computation planning. Finally, I report on successful application results obtained with a prototype implementation called Chimera, involving challenging analysis of high-energy physics and astronomy data.

[1]  Peter Z. Kunszt,et al.  Giggle: A Framework for Constructing Scalable Replica Location Services , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[2]  Ian T. Foster,et al.  A community authorization service for group collaboration , 2002, Proceedings Third International Workshop on Policies for Distributed Systems and Networks.

[3]  Gustavo Alonso,et al.  Letter from the Special Issue Editor , 1995, IEEE Data Eng. Bull..

[4]  Kavitha Ranganathan,et al.  Design and Evaluation of Dynamic Replication Strategies for a High-Performance Data Grid , 2001 .

[5]  Jennifer Widom,et al.  Practical lineage tracing in data warehouses , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[6]  Amélie Marian,et al.  Change-Centric Management of Versions in an XML Warehouse , 2001, VLDB.

[7]  Miron Livny,et al.  Conceptual Schemas: Multi-faceted Tools for Desktop Scientific Experiment Management , 1992, Int. J. Cooperative Inf. Syst..

[8]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[9]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[10]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[11]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[12]  Yong Zhao,et al.  Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[13]  Steven Tuecke,et al.  The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration , 2002 .

[14]  Jennifer Widom,et al.  Tracing the lineage of view data in a warehousing environment , 2000, TODS.

[15]  Paul Avery,et al.  The griphyn project: towards petascale virtual data grids , 2001 .

[16]  Miron Livny,et al.  Zoo: a desktop experiment management environment , 1997, SIGMOD '97.

[17]  Ian T. Foster,et al.  Grid Services for Distributed System Integration , 2002, Computer.

[18]  Kavitha Ranganathan,et al.  Decoupling computation and data scheduling in distributed data-intensive applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[19]  Frank Leymann,et al.  Managing Business Processes an an Information Resource , 1994, IBM Syst. J..

[20]  Keishi Tajima,et al.  Archiving scientific data , 2004, TODS.

[21]  Kavitha Ranganathan,et al.  Identifying Dynamic Replication Strategies for a High-Performance Data Grid , 2001, GRID.

[22]  Ian T. Foster,et al.  A security architecture for computational grids , 1998, CCS '98.