Evaluation of publicly available Barrier-Algorithms and Improvement of the Barrier-Operation for large-­scale Cluster-Systems with special Attention on InfiniBand Networks

The MPI Barrier-collective operation, as a part of the MPI-1.1 standard, is extremely important for all parallel applications using it. The latency of this operation increases the application run time and can not be overlaid. Thus, the whole MPI performance can be decreased by unsatisfactory barrier latency. The main goals of this work are to lower the barrier latency for InfiniBand networks by analyzing well known barrier algorithms with regards to their suitability within InfiniBand networks, to enhance the barrier operation by utilizing standard InfiniBand operations as much as possible, and to design a constant time barrier for InfiniBand with special hardware support. This partition into three main steps is retained throughout the whole thesis. The first part evaluates publicly known models and proposes a new more accurate model (LoP) for InfiniBand . All barrier algorithms are evaluated within the well known LogP and this new model. Two new algorithms which promise a better performance have been developed. A constant time barrier integrated into InfiniBand as well as a cheap separate barrier network is proposed in the hardware section. All results have been implemented inside the Open MPI framework. This work led to three new Open MPI collective modules. The first one implements different barrier algorithms which are dynamically benchmarked and selected during the startup phase to maximize the performance. The second one offers a special barrier implementation for InfiniBand with RDMA and performs up to 40% better than the best solution that has been published so far. The third implementation offers a constant time barrier in a separate network, leveraging commodity components, with a latency of only 2.5μs. All components have their specialty and can be used to enhance the barrier performance significantly.

[1]  Richard Cole,et al.  The APRAM: incorporating asynchrony into the PRAM model , 1989, SPAA '89.

[2]  Gregory F. Pfister,et al.  “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.

[3]  Richard P. Martin,et al.  LogP Performance Assessment of Fast Network Interfaces , 1995 .

[4]  Michael L. Scott,et al.  Synchronization without contention , 1991, ASPLOS IV.

[5]  Eugene D. Brooks,et al.  The butterfly barrier , 1986, International Journal of Parallel Programming.

[6]  Yossi Matias,et al.  Can shared-memory model serve as a bridging model for parallel computation? , 1997, SPAA '97.

[7]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[8]  Andrew A. Chien,et al.  Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming , 1999, ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming.

[9]  Dhabaleswar K. Panda,et al.  Efficient and scalable barrier over Quadrics and Myrinet with a new NIC-based collective message passing protocol , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[10]  Bruce M. Maggs,et al.  Communication-efficient parallel algorithms for distributed random-access machines , 1988, Algorithmica.

[11]  M. O'Keefe,et al.  Performance Analysis of Hardware Barrier Synchronization , 1989 .

[12]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[13]  Debra Hensgen,et al.  Two algorithms for barrier synchronization , 1988, International Journal of Parallel Programming.

[14]  John B. Andrew,et al.  Notification and Multicast Networks for Synchronization and Coherence , 1992, J. Parallel Distributed Comput..

[15]  Henry G. Dietz,et al.  Purdue’s Adapter for Parallel Execution and Rapid Synchronization: The TTL_PAPERS Design , 1995 .

[16]  Fumihiko Ino,et al.  LogGPS: a parallel computational model for synchronization analysis , 2001, PPoPP '01.

[17]  Welf Löwe,et al.  Upper time bounds for executing PRAM-programs on the LogP-machine , 1995, ICS '95.

[18]  Alok Aggarwal,et al.  On communication latency in PRAM computations , 1989, SPAA '89.

[19]  Susanne E. Hambrusch,et al.  C/sup 3/: an architecture-independent model for coarse-grained parallel machines , 1994, Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing.

[20]  Henry G. Dietz,et al.  A fine-grain parallel architecture based on barrier synchronization , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[21]  R. E. Kessler,et al.  Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.

[22]  Nian-Feng Tzeng,et al.  Distributed shared memory systems with improved barrier synchronization and data transfer , 1997, ICS '97.

[23]  Henri E. Bal,et al.  MagPIe: MPI's collective communication operations for clustered wide area systems , 1999, PPoPP '99.

[24]  Dirk Grunwald,et al.  Efficient barriers for distributed shared memory computers , 1994, Proceedings of 8th International Parallel Processing Symposium.

[25]  Henry G. Dietz,et al.  Dynamic Barrier Architecture for Multi-Mode Fine-Grain Parallelism Using Conventional Processors , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[26]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[27]  Bruce M. Maggs,et al.  Proceedings of the 28th Annual Hawaii International Conference on System Sciences- 1995 Models of Parallel Computation: A Survey and Synthesis , 2022 .

[28]  Dhabaleswar K. Panda Fast barrier synchronization in wormhole k-ary n-cube networks with multidestination worms , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[29]  Steven Fortune,et al.  Parallelism in random access machines , 1978, STOC.

[30]  Michael L. Scott,et al.  Fast, contention-free combining tree barriers for shared-memory multiprocessors , 1994, International Journal of Parallel Programming.

[31]  Constantine D. Polychronopoulos,et al.  Broadcast Networks for Fast Synchronization , 1991, ICPP.

[32]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[33]  Allan Gottlieb,et al.  Process coordination with fetch-and-increment , 1991, ASPLOS IV.

[34]  John von Neumann,et al.  First draft of a report on the EDVAC , 1993, IEEE Annals of the History of Computing.

[35]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[36]  Jeffrey C. Lagarias,et al.  Convergence Properties of the Nelder-Mead Simplex Method in Low Dimensions , 1998, SIAM J. Optim..

[37]  Dhabaleswar K. Panda Fast barrier synchronization in wormhole k-ary n-cube networks with multidestination worms , 1995, Future Gener. Comput. Syst..

[38]  Amith R. Mamidala,et al.  Fast and scalable MPI-level broadcast using InfiniBand's hardware multicast support , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[39]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[40]  Csaba Andras Moritz,et al.  LoGPC: Modeling Network Contention in Message-Passing Programs , 2001, IEEE Trans. Parallel Distributed Syst..

[41]  Kurt Mehlhorn,et al.  Randomized and deterministic simulations of PRAMs by parallel machines with restricted granularity of parallel memories , 1984, Acta Informatica.

[42]  Ralph Grishman,et al.  The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.

[43]  张思学 电脑硬件知识一:CPU(Central Processing Unit) , 2005 .

[44]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[45]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.

[46]  Dhabaleswar K. Panda,et al.  High Performance RDMA-Based MPI Implementation over InfiniBand , 2003, ICS '03.

[47]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[48]  Phillip B. Gibbons A more practical PRAM model , 1989, SPAA '89.

[49]  Dhabaleswar K. Panda,et al.  A reliable hardware barrier synchronization scheme , 1997, Proceedings 11th International Parallel Processing Symposium.

[50]  Jeffrey M. Squyres,et al.  The Component Architecture of Open MPI: Enabling Third-Party Collective Algorithms* , 2005 .

[51]  Dhabaleswar K. Panda,et al.  Efficient barrier using remote memory operations on VIA-based clusters , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[52]  Luiz Angelo Steffenel,et al.  Fast Tuning of Intra-cluster Collective Communications , 2004, PVM/MPI.

[53]  James R. Larus,et al.  CICO: A Practical Shared-Memory Programming Performance Model , 1994 .

[54]  Nian-Feng Tzeng,et al.  Distributing Hot-Spot Addressing in Large-Scale Multiprocessors , 1987, IEEE Transactions on Computers.

[55]  Susanne E. Hambrusch Models for Parallel Computation , 1996, ICPP Workshop.

[56]  Dhabaleswar K. Panda,et al.  Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[57]  Richard M. Karp,et al.  Parallel Algorithms for Shared-Memory Machines , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[58]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[59]  Corporate The MPI Forum,et al.  MPI: a message passing interface , 1993, Supercomputing '93.

[60]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[61]  Anant Agarwal,et al.  Limits on Interconnection Network Performance , 1991, IEEE Trans. Parallel Distributed Syst..