Tail queues: A multi‐threaded matching architecture

As we approach exascale, computational parallelism will have to drastically increase in order to meet throughput targets. Many‐core architectures have exacerbated this problem by trading reduced clock speeds, core complexity, and computation throughput for increasing parallelism. This presents two major challenges for communication libraries such as MPI: the library must leverage the performance advantages of thread level parallelism and avoid the scalability problems associated with increasing the number of processes to that scale. Hybrid programming models, such as MPI+X, have been proposed to address these challenges. MPI THREAD MULTIPLE is MPI's thread safe mode. While there has been work to optimize it, it largely remains non‐performant in most implementations. While current applications avoid MPI multithreading due to performance concerns, it is expected to be utilized in future applications. One of the major synchronous data structures required by MPI is the matching engine. In this paper, we present a parallel matching algorithm that can improve MPI matching for multithreaded applications. We then perform a feasibility study to demonstrate the performance benefit of the technique.

[1]  Jeffrey S. Vetter,et al.  On the Path to Exascale , 2010, Int. J. Distributed Syst. Technol..

[2]  Robert D. Falgout,et al.  Multigrid Smoothers for Ultraparallel Computing , 2011, SIAM J. Sci. Comput..

[3]  Richard L. Graham,et al.  Open MPI: A Flexible High Performance MPI , 2005, PPAM.

[4]  Anthony Skjellum,et al.  Software Architecture and Performance Comparison of MPI/Pro and MPICH , 2003, International Conference on Computational Science.

[5]  Rajeev Thakur,et al.  Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming , 2010, Int. J. High Perform. Comput. Appl..

[6]  Rajeev Thakur,et al.  Thread-safety in an MPI implementation: Requirements and analysis , 2007, Parallel Comput..

[7]  John Nickolls,et al.  Scalable parallel programming with CUDA introduction , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[8]  Holger Fröning,et al.  An Overview of MPI Characteristics of Exascale Proxy Applications , 2017, ISC.

[9]  Anthony Skjellum,et al.  Lightweight threading with MPI using Persistent Communications Semantics. , 2015 .

[10]  Rajeev Thakur,et al.  Proceedings of the 24th European MPI Users' Group Meeting , 2017, EuroMPI/USA.

[11]  Satoshi Matsuoka,et al.  MPI+Threads: runtime contention and remedies , 2015, PPOPP.

[12]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[13]  Alan Wagner,et al.  FG-MPI: Fine-grain MPI for multicore and clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[14]  Stephen L. Olivier,et al.  Toward an evolutionary task parallel integrated MPI + X programming model , 2015, PMAM@PPoPP.

[15]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[16]  Keith D. Underwood,et al.  Mitigating MPI Message Matching Misery , 2016, ISC.

[17]  Kevin T. Pedretti,et al.  Characterizing MPI matching via trace-based simulation , 2017, EuroMPI/USA.

[18]  James Dinan,et al.  Enabling Efficient Multithreaded MPI Communication through a Library-Based Implementation of MPI Endpoints , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Keith D. Underwood,et al.  Enhancing NIC performance for MPI using processing-in-memory , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[20]  Torsten Hoefler,et al.  Efficient MPI Support for Advanced Hybrid Programming Models , 2010, EuroMPI.

[21]  Dhabaleswar K. Panda,et al.  Adaptive and Dynamic Design for MPI Tag Matching , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[22]  Simon D. Hammond,et al.  An evaluation of MPI message rate on hybrid-core processors , 2014, Int. J. High Perform. Comput. Appl..

[23]  K. PandaDhabaleswar,et al.  The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI Library for HPC , 2013 .

[24]  Stephen L. Olivier,et al.  Early Experiences Co-Scheduling Work and Communication Tasks for Hybrid MPI+X Applications , 2014, 2014 Workshop on Exascale MPI at Supercomputing Conference.

[25]  Pavan Balaji,et al.  Improving concurrency and asynchrony in multithreaded MPI applications using software offloading , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Avinash Sodani,et al.  Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[27]  Michael L. Scott,et al.  Synchronization without contention , 1991, ASPLOS IV.

[28]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[29]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[30]  James H. Laros,et al.  The Impact of Injection Bandwidth Performance on Application Scalability , 2011, EuroMPI.

[31]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[32]  Ahmad Afsahi,et al.  A fast and resource-conscious MPI message queue mechanism for large-scale jobs , 2014, Future Gener. Comput. Syst..

[33]  Holger Fröning,et al.  Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[34]  Rajeev Thakur,et al.  Enabling communication concurrency through flexible MPI endpoints , 2014, Int. J. High Perform. Comput. Appl..

[35]  Ryan E. Grant,et al.  Measuring Multithreaded Message Matching Misery , 2018, Euro-Par.