Managing very large distributed data sets on a data grid

In this work we address the management of very large data sets, which need to be stored and processed across many computing sites. The motivation for our work is the ATLAS experiment for the Large Hadron Collider (LHC), where the authors have been involved in the development of the data management middleware. This middleware, called DQ2, has been used for the last several years by the ATLAS experiment for shipping petabytes of data to research centres and universities worldwide. We describe our experience in developing and deploying DQ2 on the Worldwide LHC computing Grid, a production Grid infrastructure formed of hundreds of computing sites. From this operational experience, we have identified an important degree of uncertainty that underlies the behaviour of large Grid infrastructures. This uncertainty is subjected to a detailed analysis, leading us to present novel modelling and simulation techniques for Data Grids. In addition, we discuss what we perceive as practical limits to the development of data distribution algorithms for Data Grids given the underlying infrastructure uncertainty, and propose future research directions. Copyright © 2009 John Wiley & Sons, Ltd.

[1]  Carl Kesselman,et al.  Wide area data replication for scientific collaborations , 2005, Int. J. High Perform. Comput. Netw..

[2]  Andreas Haeberlen,et al.  Efficient Replica Maintenance for Distributed Storage Systems , 2006, NSDI.

[3]  Peter Z. Kunszt,et al.  Giggle: A Framework for Constructing Scalable Replica Location Services , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[4]  Reagan Moore,et al.  The SDSC storage resource broker , 2010, CASCON.

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  Rudolf Eigenmann,et al.  Cetus: A Source-to-Source Compiler Infrastructure for Multicores , 2009, Computer.

[7]  Robert Tappan Morris,et al.  Ivy: a read/write peer-to-peer file system , 2002, OSDI '02.

[8]  Edith Cohen,et al.  Search and replication in unstructured peer-to-peer networks , 2002 .

[9]  Ian T. Foster,et al.  Data management and transfer in high-performance computational grid environments , 2002, Parallel Comput..

[10]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[11]  Rajkumar Buyya,et al.  A toolkit for modelling and simulating data Grids: an extension to GridSim , 2008 .

[12]  Tim Moors,et al.  Survey of Research towards Robust Peer-to-Peer Networks: Search Methods , 2007, RFC.

[13]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[14]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.

[15]  Roy Fielding,et al.  Architectural Styles and the Design of Network-based Software Architectures"; Doctoral dissertation , 2000 .

[16]  Mahadev Satyanarayanan,et al.  Scale and performance in a distributed file system , 1988, TOCS.

[17]  M. Humphrey,et al.  LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applications , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[18]  Bob Jones,et al.  The Organization and Management of Grid Infrastructures , 2009, Computer.

[19]  Arie Shoshani,et al.  Storage resource managers: Middleware components for gridstorage , 2005 .

[20]  Ian T. Foster,et al.  GASS: a data movement and access service for wide area computing systems , 1999, IOPADS '99.

[21]  Phil Andrews,et al.  Massive High-Performance Global File Systems for Grid computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[22]  Osamu Tatebe,et al.  Gfarm v2: A Grid file system that supports high-performance distributed and parallel data computing , 2005 .

[23]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[24]  E. Deelman,et al.  Data replication strategies in grid environments , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[25]  Werner Vogels,et al.  Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability. , 2022 .

[26]  Mahadev Satyanarayanan,et al.  Coda: A Highly Available File System for a Distributed Workstation Environment , 1990, IEEE Trans. Computers.

[27]  Edith Cohen,et al.  Replication strategies in unstructured peer-to-peer networks , 2002, SIGCOMM.

[28]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[29]  David J. DeWitt,et al.  Scientific data management in the coming decade , 2005, SGMD.