Supporting Efficient Execution in Heterogeneous Distributed Computing Environments with Cactus and Globus

Improvements in the performance of processors and networks make it both feasible and interesting to treat collections of workstations, servers, clusters, and supercomputers as integrated computational resources, or Grids. However, the highly heterogeneous and dynamic nature of such Grids can make application development di.cult. Here we describe an architecture and prototype implementation for a Grid-enabled computational framework based on Cactus, the MPICH-G2 Grid-enabled message-passing library, and a variety of specialized features to support e.cient execution in Grid environments. We have used this framework to perform record-setting computations in numerical relativity, running across four supercomputers and achieving scaling of 88% (1140 CPU’s) and 63% (1500 CPUs). The problem size we were able to compute was about five times larger than any other previous run. Further, we introduce and demonstrate adaptive methods that automatically adjust computational parameters during run time, to increase dramatically the efficiency of a distributed Grid simulation, without modification of the application and without any knowledge of the underlying network connecting the distributed computers.

[1]  Ian T. Foster,et al.  Cactus Application: Performance Predictions in Grid Environments , 2001, Euro-Par.

[2]  John Shalf,et al.  Solving Einstein's Equations on Supercomputers , 1999, Computer.

[3]  William Gropp,et al.  Reproducible Measurements of MPI Performance Characteristics , 1999, PVM/MPI.

[4]  E. Seidel,et al.  Numerical Relativity As A Tool For Computational Astrophysics , 1999, gr-qc/9904014.

[5]  Tom Goodale,et al.  The Cactus computational collaboratory: enabling technologies for relativistic astrophysics, and a toolkit for solving PDE's by communities in science and engineering , 1999, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[6]  Ian T. Foster,et al.  A Grid-Enabled MPI: Message Passing in Heterogeneous Distributed Computing Systems , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[7]  Philip F. LoCascio,et al.  The Locally Self-Consistent Multiple Scattering Code in a Geographically Distributed Linked MPP Environment , 1998, Parallel Comput..

[8]  Ian T. Foster,et al.  A security architecture for computational grids , 1998, CCS '98.

[9]  Paul Messina Distributed supercomputing applications , 1998 .

[10]  Matthew Mathis,et al.  Automatic TCP buffer tuning , 1998, SIGCOMM '98.

[11]  Francine Berman,et al.  High-performance schedulers , 1998 .

[12]  Jack J. Dongarra,et al.  MPI_Connect Managing Heterogeneous MPI Applications Ineroperation and Process Control , 1998, PVM/MPI.

[13]  Michael M. Resch,et al.  Distributed Computing in a Heterogeneous Computing Environment , 1998, PVM/MPI.

[14]  Toshiya Kimura,et al.  Local area metacomputing for multidisciplinary problems: a case study for fluid/structure coupled simulation , 1998, ICS '98.

[15]  Carl Kesselman,et al.  Implementing distributed synthetic forces simulations in metacomputing environments , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[16]  Richard Wolski,et al.  Forecasting network performance to support dynamic scheduling using the network weather service , 1997, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[17]  Al Geist,et al.  Wide-Area ATM Networking for Large-Scale MPPs , 1997, PPSC.

[18]  Robert J. Harrison,et al.  Shared memory NUMA programming on I-WAY , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[19]  Joel H. Saltz,et al.  Runtime compilation techniques for data partitioning and communication schedule reuse , 1993, Supercomputing '93. Proceedings.

[20]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[21]  Marina C. Chen,et al.  Automated Problem Mapping: the Crystal Runtime System. , 1987 .