Bandwidth-efficient collective communication for clustered wide area systems

Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks. A major problem in programming parallel applications for such platforms is their hierarchical network structure: latency and bandwidth of WANs often are orders of magnitude worse than those of local networks. Our goal is to optimize MPI's collective operations for such platforms. In this paper we focus on optimized utilization of the (scarce) wide-area bandwidth. We use two techniques: selecting suitable communication graph shapes, and splitting messages into multiple segments that are sent in parallel over different WAN links. To determine the best graph shape and segment size, we introduce a performance model called parameterized LogP (P-LogP), a hierarchical extension of the LogP model that covers messages of arbitrary length. With P-LogP, the optimal segment size and the best broadcast tree shape can be determined at runtime. (For conciseness, we restrict our discussion to the broadcast operation). An experimental performance evaluation shows that the new broadcast has significantly improved performance (for large messages) and that there is a close match between the theoretical model and the measured completion times.

[1]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[2]  L. Smarr,et al.  Metacomputing : Siggraph'92 Showcase , 1992 .

[3]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[4]  Richard M. Karp,et al.  Optimal broadcast and summation in the LogP model , 1993, SPAA '93.

[5]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[6]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[7]  Rudy Lauwereins,et al.  On the Design and Implementation of Broadcast and Global Combine Operations Using the Postal Model , 1996, IEEE Trans. Parallel Distributed Syst..

[8]  Lionel M. Ni,et al.  Construction of optimal multicast trees based on the parameterized communication model , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[9]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[10]  Bruce Lowekamp,et al.  ECO: Efficient Collective Operations for communication on heterogeneous networks , 1996, Proceedings of International Conference on Parallel Processing.

[11]  Andrew S. Grimshaw,et al.  The Legion vision of a worldwide virtual computer , 1997, Commun. ACM.

[12]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[13]  Richard Wolski,et al.  Forecasting network performance to support dynamic scheduling using the network weather service , 1997, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[14]  Dhabaleswar K. Panda,et al.  Optimal multicast with packetization and network interface support , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[15]  Massimo Bernaschi,et al.  Collective communication operations: experimental results vs. theory , 1998 .

[16]  Henri E. Bal,et al.  Performance evaluation of the Orca shared-object system , 1998, TOCS.

[17]  Massimo Bernaschi,et al.  Collective communication operations: experimental results vs. theory , 1998, Concurr. Pract. Exp..

[18]  Michael M. Resch,et al.  Distributed Computing in a Heterogeneous Computing Environment , 1998, PVM/MPI.

[19]  Scott B. Baden,et al.  Communication overlap in multi-tier parallel algorithms , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[20]  Henri E. Bal,et al.  User-Level Network Interface Protocols , 1998, Computer.

[21]  Henri E. Bal,et al.  Optimizing parallel applications for wide-area clusters , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[22]  Alexander Reinefeld,et al.  Communicating across parallel message-passing environments , 1998, J. Syst. Archit..

[23]  Jack J. Dongarra,et al.  MPI_Connect Managing Heterogeneous MPI Applications Ineroperation and Process Control , 1998, PVM/MPI.

[24]  Mario Lauria,et al.  Cross-Platform Analysis of Fast Messages for Myrinet , 1998, CANPC.

[25]  Dhabaleswar K. Panda,et al.  Efficient collective communication on heterogeneous networks of workstations , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[26]  George K. Thiruvathukal,et al.  Wide-Area Implementation of the Message Passing Interface , 1998, Parallel Comput..

[27]  James C. Hoe,et al.  MPI-StarT: Delivering Network Performance to Numerical Applications , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[28]  Eunice E. Santos,et al.  Optimal and Near-Optimal Algorithms for k-Item Broadcast , 1999, J. Parallel Distributed Comput..

[29]  Rutger F. H. Hofman,et al.  Sensitivity of parallel applications to large differences in bandwidth and latency in two-layer interconnects , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[30]  Thomas Eickermann,et al.  Distributed applications in a German gigabit WAN , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[31]  William L. George,et al.  Status Report on the Development of the Interoperable MPI Protocol , 1999 .

[32]  Sergei Gorlatch,et al.  Optimization rules for programming with collective operations , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[33]  David A. Bader,et al.  SIMPLE: A Methodology for Programming High Performance Algorithms on Clusters of Symmetric Multiprocessors (SMPs) , 1998, J. Parallel Distributed Comput..

[34]  Henri E. Bal,et al.  MPI's Reduction Operations in Clustered Wide Area Systems. , 1999 .

[35]  Maciej Go Algorithms for Collective Communication Operations on Smp Clusters , 1999 .

[36]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[37]  Kees Verstoep,et al.  Fast Measurement of LogP Parameters for Message Passing Platforms , 2000, IPDPS Workshops.

[38]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[39]  Henri E. Bal,et al.  Sensitivity of parallel applications to large differences in bandwidth and latency in two-layer interconnects , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.