Toward Operating System Support for Scalable Multithreaded Message Passing

Modern CPU architectures provide a large number of processing cores and application programmers are increasingly looking at hybrid programming models, where multiple threads of a single process interact with the MPI library simultaneously. Moreover, recent high-speed interconnection networks are being designed with capabilities targeting communication explicitly from multiple processor cores. As a result, scalability of the MPI library so that multithreaded applications can efficiently drive independent network communication has become a major concern. In this work, we propose a novel operating system level concept called the thread private shared library (TPSL), which enables threads of a multithreaded application to see specific shared libraries in a private fashion. Contrary to address spaces in traditional operating systems, where threads of a single process refer to the exact same set of virtual to physical mappings, our technique relies on per-thread separate page tables. Mapping the MPI library in a thread private fashion results in per-thread MPI ranks eliminating resource contention in the MPI library without the need for redesigning it. To demonstrate the benefits of our mechanism, we provide preliminary evaluation for various aspects of multithreaded MPI processing through micro-benchmarks on two widely used MPI implementations, MPICH and MVAPICH, with only minor modifications to the libraries.

[1]  Simon D. Hammond,et al.  The impact of hybrid-core processors on MPI message rate , 2013, EuroMPI.

[2]  Rajeev Thakur,et al.  Enabling MPI interoperability through flexible communication endpoints , 2013, EuroMPI.

[3]  Rajeev Thakur,et al.  Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems , 2010, EuroMPI.

[4]  Ewing L. Lusk,et al.  Early Experiments with the OpenMP/MPI Hybrid Programming Model , 2008, IWOMP.

[5]  Yutaka Ishikawa,et al.  CMCP: a novel page replacement policy for system level hierarchical memory management on many-cores , 2014, HPDC '14.

[6]  Kamil Iskra,et al.  Characterizing the Performance of “Big Memory” on Blue Gene Linux , 2009, 2009 International Conference on Parallel Processing Workshops.

[7]  Laxmikant V. Kalé,et al.  Automatic MPI to AMPI Program Transformation Using Photran , 2010, Euro-Par Workshops.

[8]  David E. Bernholdt,et al.  Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R , 2013, ROSS '13.

[9]  Yutaka Ishikawa,et al.  Partially Separated Page Tables for Efficient Operating System Assisted Hierarchical Memory Management on Heterogeneous Architectures , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[10]  Yutaka Ishikawa,et al.  Interface for heterogeneous kernels: A framework to enable hybrid OS designs targeting high performance computing on manycore architectures , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[11]  Pavan Balaji,et al.  MT-MPI: multithreaded MPI for many-core environments , 2014, ICS '14.

[12]  Rajeev Thakur,et al.  Toward Efficient Support for Multithreaded MPI Communication , 2008, PVM/MPI.

[13]  Erik D. Demaine,et al.  A Threads-Only MPI Implementation for the Development of Parallel Programs , 1997 .

[14]  Rolf Riesen,et al.  mOS: an architecture for extreme-scale operating systems , 2014, ROSS@ICS.

[15]  Alan J. Wallcraft,et al.  The NRL Layered Ocean Model , 1997, Parallel Comput..

[16]  Tao Yang,et al.  Optimizing threaded MPI execution on SMP clusters , 2001, ICS '01.

[17]  Suzanne M. Kelly,et al.  Software Architecture of the Light Weight Kernel, Catamount , 2005 .

[18]  Mark Giampapa,et al.  Experiences with a Lightweight Supercomputer Kernel: Lessons Learned from Blue Gene's CNK , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  M. Frans Kaashoek,et al.  RadixVM: scalable address spaces for multithreaded applications , 2013, EuroSys '13.

[20]  James Dinan,et al.  Enabling Efficient Multithreaded MPI Communication through a Library-Based Implementation of MPI Endpoints , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Dhabaleswar K. Panda,et al.  Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems , 2014, PPoPP '14.

[22]  Yoonho Park,et al.  FusedOS: Fusing LWK Performance with FWK Functionality in a Heterogeneous Environment , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[23]  Sayantan Sur,et al.  Multi-threaded UPC runtime with network endpoints: Design alternatives and evaluation on multi-core architectures , 2011, 2011 18th International Conference on High Performance Computing.

[24]  Patrick Carribault,et al.  MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption , 2009, PVM/MPI.

[25]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.