Lock Contention Management in Multithreaded MPI

In this article, we investigate contention management in lock-based thread-safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus serialization is unavoidable. Our work distinguishes between lock acquisitions with respect to work being performed inside a critical section; productive vs. unproductive. Waiting for message reception without doing anything else inside a critical section is an example of unproductive lock acquisition. We show that the high-throughput nature of modern scalable locking protocols translates into better communication progress for throughput-intensive MPI communication but negatively impacts latency-sensitive communication because of overzealous unproductive lock acquisition. To reduce unproductive lock acquisitions, we devised a method that promotes threads with productive work using a generic two-level priority locking protocol. Our results show that using a high-throughput protocol for productive work and a fair protocol for less productive code paths ensures the best tradeoff for fine-grained communication, whereas a fair protocol is sufficient for more coarse-grained communication. Although these efforts have been rewarding, scalability degradation remains significant. We discuss techniques that diverge from the pure locking model and offer the potential to further improve scalability.

[1]  John M. Mellor-Crummey,et al.  High performance locks for multi-level NUMA systems , 2015, PPoPP.

[2]  Pavan Balaji,et al.  Advanced Thread Synchronization for Multithreaded MPI Implementations , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[3]  Rajeev Thakur,et al.  Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming , 2010, Int. J. High Perform. Comput. Appl..

[4]  R. Baker,et al.  An Sn algorithm for the massively parallel CM-200 computer , 1998 .

[5]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[6]  Rajeev Thakur,et al.  Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems , 2010, EuroMPI.

[7]  Ulrich Drepper,et al.  The Native POSIX Thread Library for Linux , 2002 .

[8]  Rajeev Thakur,et al.  Enabling MPI interoperability through flexible communication endpoints , 2013, EuroMPI.

[9]  David Dice,et al.  Malthusian Locks , 2015, EuroSys.

[10]  A. Amer 1 Locking Aspects in Multithreaded MPI Implementations , 2016 .

[11]  Torsten Hoefler,et al.  MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory , 2013, Computing.

[12]  Nir Shavit,et al.  NUMA-aware reader-writer locks , 2013, PPoPP '13.

[13]  John M. Mellor-Crummey,et al.  Contention-conscious, locality-preserving locks , 2016, PPoPP.

[14]  Torsten Hoefler,et al.  Efficient MPI Support for Advanced Hybrid Programming Models , 2010, EuroMPI.

[15]  Satoshi Matsuoka,et al.  MPI+Threads: runtime contention and remedies , 2015, PPOPP.

[16]  Bronis R. de Supinski,et al.  Minimizing MPI Resource Contention in Multithreaded Multicore Environments , 2010, 2010 IEEE International Conference on Cluster Computing.

[17]  Satoshi Matsuoka,et al.  Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[18]  Rajeev Thakur,et al.  Thread-safety in an MPI implementation: Requirements and analysis , 2007, Parallel Comput..

[19]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[20]  Rupesh Nasre,et al.  DomLock: A New Multi-Granularity Locking Technique for Hierarchies , 2017, ACM Trans. Parallel Comput..

[21]  Shasha Wen,et al.  An Efficient Abortable-locking Protocol for Multi-level NUMA Systems , 2017, PPoPP.

[22]  Nir Shavit,et al.  Lock Cohorting , 2015, ACM Trans. Parallel Comput..