File and Object Replication in Data Grids

Data replication is a key issue in a data grid and can be managed in different ways and at different levels of granularity: for example, at the file level or the object level. In the high-energy physics community, data grids are being developed to support the distributed analysis of experimental data. We have produced a prototype data replication tool, the Grid Data Management Pilot (GDMP) that is in production use in one physics experiment, with middleware provided by the Globus toolkit used for authentication, data movement and other purposes. We present a new, enhanced GDMP architecture and prototype implementation that uses Globus data-grid tools for efficient file replication. We also explain how this architecture can address object replication issues in an object-oriented database management system. File transfer over wide-area networks requires specific performance tuning in order to gain optimal data transfer rates. We present performance results obtained with GridFTP, an enhanced version of FTP, and discuss tuning parameters.

[1]  Ian T. Foster,et al.  Replica selection in the Globus Data Grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[2]  L. O. Hertzberger,et al.  Computing in high energy physics , 1986 .

[3]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[4]  Koen Holtman Object level physics data replication in the Grid , 2001 .

[5]  John Linn,et al.  Generic Security Service Application Program Interface Version 2, Update 1 , 2000, RFC.

[6]  Martin Schaller Reclustering of high energy physics data , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[7]  Yin Zhang,et al.  On individual and aggregate TCP performance , 1999, Proceedings. Seventh International Conference on Network Protocols.

[8]  Ian Foster,et al.  A quality of service architecture that combines resource reservation and application adaptation , 2000, 2000 Eighth International Workshop on Quality of Service. IWQoS 2000 (Cat. No.00EX400).

[9]  Reagan Moore,et al.  Data-intensive computing , 1998 .

[10]  Javier Jaén Martínez,et al.  Data Management in an International Data Grid Project , 2000, GRID.

[11]  John Linn,et al.  Generic Security Service Application Program Interface, Version 2 , 1997, RFC.

[12]  Ian T. Foster,et al.  Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing , 2001, 2001 Eighteenth IEEE Symposium on Mass Storage Systems and Technologies.

[13]  George Yang,et al.  Network Characterization Service (NCS) , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[14]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[15]  Ian T. Foster,et al.  A security architecture for computational grids , 1998, CCS '98.

[16]  Heinz Stockinger,et al.  Building a large location table to find replicas of physics objects , 2001 .

[17]  David R. Karger,et al.  Web Caching with Consistent Hashing , 1999, Comput. Networks.

[18]  Ian Foster,et al.  The Globus toolkit , 1998 .

[19]  Heinz Stockinger,et al.  Grid Data Management Pilot (GDMP): A Tool for Wide Area Replication , 2001 .

[20]  Carl Kesselman,et al.  High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[21]  Brian Tierney,et al.  TCP Tuning Guide for Distributed Application on Wide Area Networks , 2001, login Usenix Mag..

[22]  Michael Dahlin,et al.  Design considerations for distributed caching on the Internet , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[23]  Robert Tappan Morris,et al.  TCP behavior with many flows , 1997, Proceedings 1997 International Conference on Network Protocols.

[24]  Heinz Stockinger Distributed Database Management Systems and the Data Grid , 2001, 2001 Eighteenth IEEE Symposium on Mass Storage Systems and Technologies.

[25]  Jason Lee,et al.  Distributed parallel data storage systems: a scalable approach to high speed image servers , 1994, MULTIMEDIA '94.

[26]  Koen Holtman,et al.  Automatic reclustering of objects in very large databases for high energy physics , 1998, Proceedings. IDEAS'98. International Database Engineering and Applications Symposium (Cat. No.98EX156).

[27]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[28]  Arie Shoshani,et al.  Access Coordination of Tertiary Storage for High Energy Physics Applications , 2000, IEEE Symposium on Mass Storage Systems.

[29]  Y. Morita,et al.  Evaluation of Objectivity/AMS on the wide area network , 2001 .