High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies

In numerous scientific disciplines, terabyte and soon petabyte-scale data collections are emerging as critical community resources. A new class of Data Grid infrastructure is required to support management, transport, distributed access to, and analysis of these datasets by potentially thousands of users. Researchers who face this challenge include the Climate Modeling community, which performs long-duration computations accompanied by frequent output of very large files that must be further analyzed. We describe the Earth System Grid prototype, which brings together advanced analysis, replica management, data transfer, request management, and other technologies to support high-performance, interactive analysis of replicated data. We present performance results that demonstrate our ability to manage the location and movement of large datasets from the user’s desktop. We report on experiments conducted over SciNET at SC’2000, where we achieved peak performance of 1.55Gb/s and sustained performance of 512.9Mb/s for data transfers between Texas and California.

[1]  Joel H. Saltz,et al.  Exploration and Visualization of Very Large Datasets with the Active Data Repository , 2001 .

[2]  Ian T. Foster,et al.  Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing , 2001, 2001 Eighteenth IEEE Symposium on Mass Storage Systems and Technologies.

[3]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Scalable Comput. Pract. Exp..

[4]  Ian T. Foster,et al.  A security architecture for computational grids , 1998, CCS '98.

[5]  Richard Wolski,et al.  Forecasting network performance to support dynamic scheduling using the network weather service , 1997, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[6]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[7]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[8]  Yin Zhang,et al.  On individual and aggregate TCP performance , 1999, Proceedings. Seventh International Conference on Network Protocols.

[9]  Joel H. Saltz,et al.  Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[10]  Arie Shoshani,et al.  Storage resource managers: essential components for the Grid , 2003 .

[11]  Ian T. Foster,et al.  Grid information services for distributed resource sharing , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[12]  Peter Z. Kunszt,et al.  Giggle: A Framework for Constructing Scalable Replica Location Services , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[13]  Jason Lee,et al.  NetLogger: a toolkit for distributed system performance analysis , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[14]  C. Kesselman,et al.  A Metadata Catalog Service for Data Intensive Applications , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[15]  Ian T. Foster,et al.  A community authorization service for group collaboration , 2002, Proceedings Third International Workshop on Policies for Distributed Systems and Networks.

[16]  Brian Tierney,et al.  File and Object Replication in Data Grids , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[17]  Tim Howes,et al.  Lightweight Directory Access Protocol (v3) , 1997, RFC.

[18]  Joel H. Saltz,et al.  Visualization of Large Data Sets with the Active Data Repository , 2001, IEEE Computer Graphics and Applications.

[19]  Ian T. Foster,et al.  Data management and transfer in high-performance computational grid environments , 2002, Parallel Comput..

[20]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[21]  Brian Tierney,et al.  TCP Tuning Guide for Distributed Application on Wide Area Networks , 2001, login Usenix Mag..

[22]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[23]  Carl Kesselman,et al.  High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[24]  Dean N. Williams,et al.  Climate Data Analysis Tools - (CDAT) , 2003 .

[25]  Ian T. Foster,et al.  Replica selection in the Globus Data Grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.