Process-in-process: techniques for practical address-space sharing

The two most common parallel execution models for many-core CPUs today are multiprocess (e.g., MPI) and multithread (e.g., OpenMP). The multiprocess model allows each process to own a private address space, although processes can explicitly allocate shared-memory regions. The multithreaded model shares all address space by default, although threads can explicitly move data to thread-private storage. In this paper, we present a third model called process-in-process (PiP), where multiple processes are mapped into a single virtual address space. Thus, each process still owns its process-private storage (like the multiprocess model) but can directly access the private storage of other processes in the same virtual address space (like the multithread model). The idea of address-space sharing between multiple processes itself is not new. What makes PiP unique, however, is that its design is completely in user space, making it a portable and practical approach for large supercomputing systems where porting existing OS-based techniques might be hard. The PiP library is compact and is designed for integrating with other runtime systems such as MPI and OpenMP as a portable low-level support for boosting communication performance in HPC applications. We showcase the uniqueness of the PiP environment through both a variety of parallel runtime optimizations and direct use in a data analysis application. We evaluate PiP on several platforms including two high-ranking supercomputers, and we measure and analyze the performance of PiP by using a variety of micro- and macro-kernels, a proxy application as well as a data analysis application.

[1]  Yutaka Ishikawa,et al.  Eliminating Costs for Crossing Process Boundary from MPI Intra-node Communication , 2014, EuroMPI/ASIA.

[2]  Kenjiro Taura,et al.  MassiveThreads: A Thread Library for High Productivity Languages , 2014, Concurrent Objects and Beyond.

[3]  Guillaume Mercier,et al.  Implementation and Shared-Memory Evaluation of MPICH2 over the Nemesis Communication Subsystem , 2006, PVM/MPI.

[4]  Torsten Hoefler,et al.  Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  Kevin T. Pedretti,et al.  SMARTMAP: Operating system support for efficient data sharing among processes on a multi-core processor , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Ada Gavrilovska,et al.  SmartBlock: An Approach to Standardizing In Situ Workflow Components , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[7]  Kevin Pedretti,et al.  An Intra-Node Implementation of OpenSHMEM Using Virtual Address Space Mapping. , 2011 .

[8]  Patrick Carribault,et al.  MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption , 2009, PVM/MPI.

[9]  Tao Yang,et al.  Compile/run-time support for threaded MPI execution on multiprogrammed shared memory machines , 1999, PPoPP '99.

[10]  Rajeev Thakur,et al.  PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems , 2010, EuroMPI.

[11]  Fumiyoshi Shoji,et al.  The K computer: Japanese next-generation supercomputer development project , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[12]  Torsten Hoefler,et al.  NUMA-aware shared-memory collective communication for MPI , 2013, HPDC.

[13]  Xin Zhao,et al.  Scalable Memory Use in MPI: A Case Study with MPICH2 , 2011, EuroMPI.

[14]  Thomas W. Jones,et al.  WOMBAT: A Scalable and High-performance Astrophysical Magnetohydrodynamics Code , 2017, 1701.07452.

[15]  Raymond Namyst,et al.  MPC: A Unified Parallel Runtime for Clusters of NUMA Machines , 2008, Euro-Par.

[16]  Satoshi Matsuoka,et al.  MPI+Threads: runtime contention and remedies , 2015, PPOPP.

[17]  James Dinan,et al.  Enabling Efficient Multithreaded MPI Communication through a Library-Based Implementation of MPI Endpoints , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Yutaka Ishikawa MPC++ approach to parallel computing environment , 1996, SIAP.

[19]  Scott Klasky,et al.  DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.

[20]  Yutaka Ishikawa,et al.  Proposing a new task model towards many-core architecture , 2013, MES '13.

[21]  Rajeev Thakur,et al.  Enabling MPI interoperability through flexible communication endpoints , 2013, EuroMPI.

[22]  Patrick Carribault,et al.  Hierarchical Local Storage: Exploiting Flexible User-Data Sharing Between MPI Tasks , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[23]  Mitsuhisa Sato,et al.  Productivity and Performance of Global-View Programming with XcalableMP PGAS Language , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[24]  Karsten Schwan,et al.  Flexpath: Type-Based Publish/Subscribe System for Large-Scale Science Analytics , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[25]  Alex Brooks,et al.  Argobots: A Lightweight Low-Level Threading and Tasking Framework , 2018, IEEE Transactions on Parallel and Distributed Systems.

[26]  Laxmikant V. Kalé,et al.  Automatic Handling of Global Variables for Multi-threaded MPI Programs , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[27]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[28]  Michael Woodacre The SGI® Altix 3000 Global Shared-Memory Architecture , 2003 .

[29]  Randal S. Baker PARTISN on Advanced/Heterogeneous Processing Systems , 2013 .

[30]  Yutaka Ishikawa,et al.  On the Scalability, Performance Isolation and Device Driver Transparency of the IHK/McKernel Hybrid Lightweight Kernel , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[31]  Rajeev Thakur,et al.  Toward Efficient Support for Multithreaded MPI Communication , 2008, PVM/MPI.

[32]  Erik D. Demaine,et al.  A Threads-Only MPI Implementation for the Development of Parallel Programs , 1997 .

[33]  Utkarsh Ayachit,et al.  The SENSEI Generic In Situ Interface , 2016, 2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV).

[34]  Karsten Schwan,et al.  FlexIO: I/O Middleware for Location-Flexible Scientific Data Analytics , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[35]  Torsten Hoefler,et al.  Leveraging MPI's One-Sided Communication Interface for Shared-Memory Programming , 2012, EuroMPI.

[36]  Yutaka Ishikawa,et al.  Toward Operating System Support for Scalable Multithreaded Message Passing , 2015, EuroMPI.

[37]  Dmitriy Morozov,et al.  Master of Puppets: Cooperative Multitasking for In Situ Processing , 2016, HPDC.

[38]  Jack Dongarra,et al.  TOP500 Sublist for November 2001 , 2001 .

[39]  Mathias Payer Too much PIE is bad for performance , 2012 .

[40]  Pavan Balaji,et al.  Advanced Thread Synchronization for Multithreaded MPI Implementations , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[41]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[42]  Juan Carlos Díaz Martín,et al.  An MPI-1 Compliant Thread-Based Implementation , 2009, PVM/MPI.

[43]  Patrick Carribault,et al.  Thread-Local Storage Extension to Support Thread-Based MPI/OpenMP Applications , 2011, IWOMP.

[44]  Torsten Hoefler,et al.  Hybrid MPI: Efficient message passing for multi-core systems , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).