Extending collective operations with application semantics for improving multi-cluster performance

We identify two ways of increasing the performance of allreduce-style of collective operations in a multi-cluster with large WAN latencies: (i) hiding latency in system noise, and (ii) conditional-allreduce where knowledge about the application is used to reduce the number of WAN messages. In our multicluster, system noise was not large enough to hide the WAN latency. But, the latency could be hidden using conditional-allreduce, since on many iterations only cluster-local values were needed, and many of the values needed from other clusters were prefetched. A speedup of 2.4 was achieved for a microbenchmark. Prefetching introduced a small overhead in the cluster with the slowest hosts.

[1]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[2]  John Markus Bjørndalen,et al.  EventSpace - Exposing and Observing Communication Behavior of Parallel Cluster Applications , 2003, Euro-Par.

[3]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[4]  Sanjeev Kumar,et al.  Evaluating synchronization on shared address space multiprocessors: methodology and performance , 1999, SIGMETRICS '99.

[5]  Henri E. Bal,et al.  Sensitivity of parallel applications to large differences in bandwidth and latency in two-layer interconnects , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[6]  Steve Sistare,et al.  Optimization of MPI Collectives on Clusters of Large-Scale SMP's , 1999, SC.

[7]  Dhabaleswar K. Panda,et al.  Fast collective operations using shared and remote memory access protocols on clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[8]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[9]  John Markus Bjørndalen,et al.  Collective Communication Performance Analysis Within the Communication System , 2004, Euro-Par.

[10]  Jason Maassen,et al.  Programming environments for high-performance Grid computing: the Albatross project , 2002, Future Gener. Comput. Syst..

[11]  Jeffrey S. Vetter,et al.  Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[12]  Terry Jones,et al.  Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[13]  James C. Hoe,et al.  MPI-StarT: Delivering Network Performance to Numerical Applications , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[14]  Xin Yuan,et al.  CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters , 2003, PPoPP '03.

[15]  Brian Vinter,et al.  Java PastSet: a structured distributed shared memory system , 2003, IEE Proc. Softw..

[16]  Tao Yang,et al.  Optimizing threaded MPI execution on SMP clusters , 2001, ICS '01.

[17]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[18]  Dhabaleswar K. Panda,et al.  Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[19]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[20]  R. V. van Nieuwpoort,et al.  The Grid 2: Blueprint for a New Computing Infrastructure , 2003 .

[21]  Brian Vinter,et al.  Past-Set - A Distributed Structured Shared Memory System , 1999, HPCN Europe.

[22]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[23]  Jeffrey S. Vetter,et al.  An Empirical Performance Evaluation of Scalable Scientific Applications , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[24]  Scott Pakin,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8, 192 Processors of ASCI Q , 2003, SC.