Early Experiences Co-Scheduling Work and Communication Tasks for Hybrid MPI+X Applications

Advances in node-level architecture and interconnect technology needed to reach extreme scale necessitate a reevaluation of long-standing models of computation, in particular bulk synchronous processing. The end of Dennard-scaling and subsequent increases in CPU core counts each successive generation of general purpose processor has made the ability to leverage parallelism for communication an increasingly critical aspect for future extreme-scale application performance. But the use of massive multithreading in combination with MPI is an open research area, with many proposed approaches requiring code changes that can be unfeasible for important large legacy applications already written in MPI. This paper covers the design and initial evaluation of an extension of a massive multithreading runtime system supporting dynamic parallelism to interface with MPI to handle fine-grain parallel communication and communication-computation overlap. Our initial evaluation of the approach uses the ubiquitous stencil computation, in three dimensions, with the halo exchange as the driving example that has a demonstrated tie to real code bases. The preliminary results suggest that even for a very well-studied and balanced workload and message exchange pattern, co-scheduling work and communication tasks is effective at significant levels of decomposition using up to 131,072 cores. Furthermore, we demonstrate useful communication-computation overlap when handling blocking send and receive calls, and show evidence suggesting that we can decrease the burstiness of network traffic, with a corresponding decrease in the rate of stalls (congestion) seen on the host link and network.

[1]  Graph Topology MPI at Exascale , 2010 .

[2]  A. Skjellum,et al.  A thread taxonomy for MPI , 1996, Proceedings. Second MPI Developer's Conference.

[3]  Stephen L. Olivier,et al.  Scheduling OpenMP for Qthreads with MAESTRO , 2011 .

[4]  Bronis R. de Supinski,et al.  Minimizing MPI Resource Contention in Multithreaded Multicore Environments , 2010, 2010 IEEE International Conference on Cluster Computing.

[5]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[6]  Rajeev Thakur,et al.  Test suite for evaluating performance of multithreaded MPI communication , 2009, Parallel Comput..

[7]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[8]  Ahmad Afsahi,et al.  A Speculative and Adaptive MPI Rendezvous Protocol Over RDMA-enabled Interconnects , 2009, International Journal of Parallel Programming.

[9]  Rajeev Thakur,et al.  Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems , 2010, EuroMPI.

[10]  Courtenay T. Vaughan,et al.  Application‐driven analysis of two generations of capability computing: the transition to multicore processors , 2012, Concurr. Comput. Pract. Exp..

[11]  Pavan Balaji,et al.  MT-MPI: multithreaded MPI for many-core environments , 2014, ICS '14.

[12]  Rajeev Thakur,et al.  Thread-safety in an MPI implementation: Requirements and analysis , 2007, Parallel Comput..

[13]  Tomas Plachetka (Quasi-) Thread-Safe PVM and (Quasi-) Thread-Safe MPI without Active Polling , 2002, PVM/MPI.

[14]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[15]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[16]  Jesús Carretero,et al.  MiMPI: A Multithred-Safe Implementation of MPI , 1999, PVM/MPI.

[17]  Sandia Report,et al.  MiniGhost: A Miniapp for Exploring Boundary Exchange Strategies Using Stencil Computations in Scientific Parallel Computing , 2012 .

[18]  Tao Yang,et al.  Optimizing threaded MPI execution on SMP clusters , 2001, ICS '01.

[19]  Anthony Skjellum,et al.  A Multithreaded Message Passing Interface (MPI) Architecture: Performance and Program Issues , 2001, J. Parallel Distributed Comput..

[20]  Rajeev Thakur,et al.  Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming , 2010, Int. J. High Perform. Comput. Appl..

[21]  Collin McCurdy,et al.  Early evaluation of IBM BlueGene/P , 2008, HiPC 2008.

[22]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[23]  Torsten Hoefler,et al.  Mpi on Millions of Cores * , 2022 .

[24]  Rajeev Thakur,et al.  Toward Efficient Support for Multithreaded MPI Communication , 2008, PVM/MPI.

[25]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.

[26]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[27]  Douglas Doerfler,et al.  Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces , 2006, PVM/MPI.

[28]  Erik D. Demaine,et al.  A Threads-Only MPI Implementation for the Development of Parallel Programs , 1997 .

[29]  Courtenay T. Vaughan,et al.  Navigating an Evolutionary Fast Path to Exascale , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[30]  Vivek Sarkar,et al.  Integrating Asynchronous Task Parallelism with MPI , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[31]  Torsten Hoefler,et al.  MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory , 2013, Computing.

[32]  Richard C. Murphy,et al.  The Chapel Tasking Layer Over Qthreads. , 2011 .

[33]  Philippe Olivier Alexandre Navaux,et al.  Challenges and Issues of Supporting Task Parallelism in MPI , 2010, EuroMPI.

[34]  Laxmikant V. Kalé,et al.  Performance evaluation of adaptive MPI , 2006, PPoPP '06.

[35]  Tao Yang,et al.  Program transformation and runtime support for threaded MPI execution on shared-memory machines , 2000, TOPL.

[36]  J. M. McGlaun,et al.  CTH: A software family for multi-dimensional shock physics analysis , 1995 .

[37]  Rajeev Thakur,et al.  Enabling communication concurrency through flexible MPI endpoints , 2014, Int. J. High Perform. Comput. Appl..

[38]  Sadaf R. Alam,et al.  Scientific Application Requirements for Leadership Computing at the Exascale , 2007 .

[39]  Patrick Carribault,et al.  MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption , 2009, PVM/MPI.

[40]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[41]  Courtenay T. Vaughan,et al.  Reducing the Bulk in the Bulk Synchronous Parallel Model , 2013, Parallel Process. Lett..

[42]  Allan Porterfield,et al.  OpenMP task scheduling strategies for multicore NUMA systems , 2012, Int. J. High Perform. Comput. Appl..

[43]  E. Lusk,et al.  Goals guiding design: PVM and MPI , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[44]  Galen M. Shipman,et al.  Jaguar: The World's Most Powerful Computer System - An Update , 2010 .

[45]  Stephen W. Poole,et al.  Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[46]  Tao Yang,et al.  Adaptive Two-level Thread Management for Fast MPI Execution on Shared Memory Machines , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[47]  Richard F. Barrett,et al.  A Taxonomy of MPI-Oriented Usage Models in Parallelized Scientific Codes , 2009, Software Engineering Research and Practice.