Exploiting communication concurrency on high performance computing systems

Although logically available, applications may not exploit enough instantaneous communication concurrency to maximize hardware utilization on HPC systems. This is exacerbated in hybrid programming models such as SPMD+OpenMP. We present the design of a "multi-threaded" runtime able to transparently increase the instantaneous network concurrency and to provide near saturation bandwidth, independent of the application configuration and dynamic behavior. The runtime forwards communication requests from application level tasks to multiple communication servers. Our techniques alleviate the need for spatial and temporal application level message concurrency optimizations. Experimental results show improved message throughput and bandwidth by as much as 150% for 4KB bytes messages on InfiniBand and by as much as 120% for 4KB byte messages on Cray Aries. For more complex operations such as all-to-all collectives, we observe as much as 30% speedup. This translates into 23% speedup on 12,288 cores for a NAS FT implemented using FFTW. We also observe as much as 76% speedup on 1,500 cores for an already optimized UPC+OpenMP geometric multigrid application using hybrid parallelism.

[1]  Katherine A. Yelick,et al.  On the conditions for efficient interoperability with threads: an experience with PGAS languages using cray communication domains , 2014, ICS '14.

[2]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[3]  Torsten Hoefler,et al.  Mpi on Millions of Cores * , 2022 .

[4]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[5]  Rajeev Thakur,et al.  Enabling MPI interoperability through flexible communication endpoints , 2013, EuroMPI.

[6]  Rajeev Thakur,et al.  Test suite for evaluating performance of multithreaded MPI communication , 2009, Parallel Comput..

[7]  Rajeev Thakur,et al.  Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems , 2010, EuroMPI.

[8]  Yuanyuan Yang,et al.  Efficient all-to-all broadcast in all-port mesh and torus networks , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[9]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[10]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[11]  Dhabaleswar K. Panda,et al.  Congestion avoidance on manycore high performance computing systems , 2012, ICS '12.

[12]  Bronis R. de Supinski,et al.  Minimizing MPI Resource Contention in Multithreaded Multicore Environments , 2010, 2010 IEEE International Conference on Cluster Computing.

[13]  Samuel Williams,et al.  Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[15]  Leonid Oliker,et al.  Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark , 2012 .

[16]  Dhabaleswar K. Panda,et al.  Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems , 2014, PPoPP '14.

[17]  Laxmikant V. Kalé,et al.  Scaling all-to-all multicast on fat-tree networks , 2004, Proceedings. Tenth International Conference on Parallel and Distributed Systems, 2004. ICPADS 2004..

[18]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[19]  Bryan Carpenter,et al.  ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[20]  Pavan Balaji,et al.  MT-MPI: multithreaded MPI for many-core environments , 2014, ICS '14.

[21]  Samuel Williams,et al.  Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Vivek Sarkar,et al.  Integrating Asynchronous Task Parallelism with MPI , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[23]  D Bonachea,et al.  UPC Language and Library Specifications, Version 1.3 , 2013 .

[24]  Jack Dongarra,et al.  Special Issue on Program Generation, Optimization, and Platform Adaptation , 2005, Proc. IEEE.

[25]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[26]  Katherine A. Yelick,et al.  Hybrid PGAS runtime support for multicore nodes , 2010, PGAS '10.

[27]  Samuel Williams,et al.  Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms , 2011, Parallel Comput..

[28]  Katherine A. Yelick,et al.  An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[29]  Samuel Williams,et al.  Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[30]  Yuanyuan Yang,et al.  Near-optimal all-to-all broadcast in multidimensional all-port meshes and tori , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.