论文信息 - Lock Contention Management in Multithreaded MPI

Lock Contention Management in Multithreaded MPI

In this article, we investigate contention management in lock-based thread-safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus serialization is unavoidable. Our work distinguishes between lock acquisitions with respect to work being performed inside a critical section; productive vs. unproductive. Waiting for message reception without doing anything else inside a critical section is an example of unproductive lock acquisition. We show that the high-throughput nature of modern scalable locking protocols translates into better communication progress for throughput-intensive MPI communication but negatively impacts latency-sensitive communication because of overzealous unproductive lock acquisition. To reduce unproductive lock acquisitions, we devised a method that promotes threads with productive work using a generic two-level priority locking protocol. Our results show that using a high-throughput protocol for productive work and a fair protocol for less productive code paths ensures the best tradeoff for fine-grained communication, whereas a fair protocol is sufficient for more coarse-grained communication. Although these efforts have been rewarding, scalability degradation remains significant. We discuss techniques that diverge from the pure locking model and offer the potential to further improve scalability.

[1] John M. Mellor-Crummey,et al. High performance locks for multi-level NUMA systems , 2015, PPoPP.

[2] Pavan Balaji,et al. Advanced Thread Synchronization for Multithreaded MPI Implementations , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[3] Rajeev Thakur,et al. Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming , 2010, Int. J. High Perform. Comput. Appl..

[4] R. Baker,et al. An Sn algorithm for the massively parallel CM-200 computer , 1998 .

[5] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[6] Rajeev Thakur,et al. Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems , 2010, EuroMPI.

[7] Ulrich Drepper,et al. The Native POSIX Thread Library for Linux , 2002 .

[8] Rajeev Thakur,et al. Enabling MPI interoperability through flexible communication endpoints , 2013, EuroMPI.

[9] David Dice,et al. Malthusian Locks , 2015, EuroSys.

[10] A. Amer. 1 Locking Aspects in Multithreaded MPI Implementations , 2016 .

[11] Torsten Hoefler,et al. MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory , 2013, Computing.

[12] Nir Shavit,et al. NUMA-aware reader-writer locks , 2013, PPoPP '13.

[13] John M. Mellor-Crummey,et al. Contention-conscious, locality-preserving locks , 2016, PPoPP.

[14] Torsten Hoefler,et al. Efficient MPI Support for Advanced Hybrid Programming Models , 2010, EuroMPI.

[15] Satoshi Matsuoka,et al. MPI+Threads: runtime contention and remedies , 2015, PPOPP.

[16] Bronis R. de Supinski,et al. Minimizing MPI Resource Contention in Multithreaded Multicore Environments , 2010, 2010 IEEE International Conference on Cluster Computing.

[17] Satoshi Matsuoka,et al. Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[18] Rajeev Thakur,et al. Thread-safety in an MPI implementation: Requirements and analysis , 2007, Parallel Comput..

[19] Guillaume Mercier,et al. hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[20] Rupesh Nasre,et al. DomLock: A New Multi-Granularity Locking Technique for Hierarchies , 2017, ACM Trans. Parallel Comput..

[21] Shasha Wen,et al. An Efficient Abortable-locking Protocol for Multi-level NUMA Systems , 2017, PPoPP.

[22] Nir Shavit,et al. Lock Cohorting , 2015, ACM Trans. Parallel Comput..