Data Grids: a new computational infrastructure for data-intensive science

Twenty–first–century scientific and engineering enterprises are increasingly characterized by their geographic dispersion and their reliance on large data archives. These characteristics bring with them unique challenges. First, the increasing size and complexity of modern data collections require significant investments in information technologies to store, retrieve and analyse them. Second, the increased distribution of people and resources in these projects has made resource sharing and collaboration across significant geographic and organizational boundaries critical to their success. In this paper I explore how computing infrastructures based on Data Grids offer data–intensive enterprises a comprehensive, scalable framework for collaboration and resource sharing. A detailed example of a Data Grid framework is presented for a Large Hadron Collider experiment, where a hierarchical set of laboratory and university resources comprising petaflops of processing power and a multi–petabyte data archive must be efficiently used by a global collaboration. The experience gained with these new information systems, providing transparent managed access to massive distributed data collections, will be applicable to large–scale, data–intensive problems in a wide spectrum of scientific and engineering disciplines, and eventually in industry and commerce. Such systems will be needed in the coming decades as a central element of our information–based society.

[1]  E. al.,et al.  Weak Lensing with SDSS Commissioning Data: The Galaxy-Mass Correlation Function To 1/h Mpc , 1999, astro-ph/9912119.

[2]  M A.,et al.  WEAK LENSING WITH SLOAN DIGITAL SKY SURVEY COMMISSIONING DATA: THE GALAXY-MASS CORRELATION FUNCTION TO 1 h~1 Mpc , 2000 .

[3]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[4]  Ian T. Foster,et al.  Grid information services for distributed resource sharing , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[5]  Paul Avery,et al.  The griphyn project: towards petascale virtual data grids , 2001 .

[6]  William E. Johnston,et al.  Grids as production computing environments: the engineering aspects of NASA's Information Power Grid , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[7]  Ian T. Foster,et al.  Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing , 2001, 2001 Eighteenth IEEE Symposium on Mass Storage Systems and Technologies.

[8]  Reagan Moore,et al.  Data-intensive computing , 1998 .

[9]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[10]  Robert Gardner,et al.  An International Virtual-Data Grid Laboratory for Data Intensive Science , 2001 .

[11]  Carl Kesselman,et al.  GriPhyN/PPDG Data Grid Architecture, Toolkit, and Roadmap — Version 2 — , 2001 .

[12]  Rajesh Raman,et al.  High-throughput resource management , 1998 .

[13]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[14]  Charles E. Catlett,et al.  From the I-WAY to the National Technology Grid , 1997, CACM.