Evaluating Collectives in Networks of Multicore / Two-level Reduction

As clusters of multicore nodes become the standard platform for HPC, programmers are adopting approaches that combine multicore programming (e.g., OpenMP) for on-node parallelism with MPI for inter-node parallelism—the so-called “MPI+X”. In important use cases, such as reductions, this hybrid approach can necessitate a scalability-limiting sequence of independent parallel operations, one for each paradigm. For example, MPI+OpenMP typically performs a global parallel reduction by first performing a local OpenMP reduction, followed by an MPI reduction across the nodes. If the local reductions are not well-balanced, which can happen in the case of irregular or dynamic adaptive applications, the scalability of the overall reduction operation becomes limited. In this paper we study the empirical and theoretical impact of imbalanced reductions on two different execution models: MPI+X and AMT (Asynchronous Many Tasking), with MPI+OpenMP and HPX-5 as concrete instances of these respective models. We explore several approaches of maximizing asynchrony with MPI+OpenMP, including using OpenMP tasking, as well as the case of MPI only, detaching X altogether. We study the effects of imbalanced reductions for microbenchmarks and for the Lulesh mini-app.Despite maximizing MPI+OpenMP asynchrony, we find that as scale and noise increases, scalability of the MPI+X model is significantly reduced compared to the AMT model.

[1]  Torsten Hoefler,et al.  The impact of network noise at large-scale communication performance , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[2]  Ron Brightwell,et al.  Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, HiPC 2008.

[3]  Torsten Hoefler,et al.  Characterizing the Influence of System Noise on Large-Scale Applications by Simulation , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[5]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[7]  Rudolf Eigenmann,et al.  Combining Message-passing and Directives in Parallel Applications , 2000 .

[8]  Torsten Hoefler,et al.  Design, Implementation, and Usage of LibNBC , 2006 .

[9]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[10]  Alejandro Duran,et al.  Dynamic load balancing of MPI+OpenMP applications , 2004 .

[11]  D. Martin Swany,et al.  Photon: Remote Memory Access Middleware for High-Performance Runtime Systems , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[12]  Ewing L. Lusk,et al.  Early Experiments with the OpenMP/MPI Hybrid Programming Model , 2008, IWOMP.

[13]  Scott Pakin,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8, 192 Processors of ASCI Q , 2003, SC.

[14]  Rajeev Thakur,et al.  Enabling MPI interoperability through flexible communication endpoints , 2013, EuroMPI.

[15]  Nisheeth K. Vishnoi,et al.  The Impact of Noise on the Scaling of Collectives: A Theoretical Approach , 2005, HiPC.

[16]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[17]  Susan Coghlan,et al.  Benchmarking the effects of operating system interference on extreme-scale parallel machines , 2008, Cluster Computing.

[18]  Andrew Lumsdaine,et al.  Comparison of Single Source Shortest Path Algorithms on Two Recent Asynchronous Many-task Runtime Systems , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[19]  Danesh K. Tafti,et al.  A Parallel Computing Framework for Dynamic Power Balancing in Adaptive Mesh Refinement Applications , 2000 .

[20]  Bo Zhang,et al.  Towards Exascale Co-design in a Runtime System , 2014, EASC.

[21]  Martin Schulz,et al.  Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[22]  Danesh K. Tafti,et al.  A Parallel Adaptive Mesh Refinement Algorithm for Solving Nonlinear Dynamical Systems , 2004, Int. J. High Perform. Comput. Appl..

[23]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).