论文信息 - Process-in-process: techniques for practical address-space sharing

Process-in-process: techniques for practical address-space sharing

The two most common parallel execution models for many-core CPUs today are multiprocess (e.g., MPI) and multithread (e.g., OpenMP). The multiprocess model allows each process to own a private address space, although processes can explicitly allocate shared-memory regions. The multithreaded model shares all address space by default, although threads can explicitly move data to thread-private storage. In this paper, we present a third model called process-in-process (PiP), where multiple processes are mapped into a single virtual address space. Thus, each process still owns its process-private storage (like the multiprocess model) but can directly access the private storage of other processes in the same virtual address space (like the multithread model). The idea of address-space sharing between multiple processes itself is not new. What makes PiP unique, however, is that its design is completely in user space, making it a portable and practical approach for large supercomputing systems where porting existing OS-based techniques might be hard. The PiP library is compact and is designed for integrating with other runtime systems such as MPI and OpenMP as a portable low-level support for boosting communication performance in HPC applications. We showcase the uniqueness of the PiP environment through both a variety of parallel runtime optimizations and direct use in a data analysis application. We evaluate PiP on several platforms including two high-ranking supercomputers, and we measure and analyze the performance of PiP by using a variety of micro- and macro-kernels, a proxy application as well as a data analysis application.

[1] Yutaka Ishikawa,et al. Eliminating Costs for Crossing Process Boundary from MPI Intra-node Communication , 2014, EuroMPI/ASIA.

[2] Kenjiro Taura,et al. MassiveThreads: A Thread Library for High Productivity Languages , 2014, Concurrent Objects and Beyond.

[3] Guillaume Mercier,et al. Implementation and Shared-Memory Evaluation of MPICH2 over the Nemesis Communication Subsystem , 2006, PVM/MPI.

[4] Torsten Hoefler,et al. Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5] Kevin T. Pedretti,et al. SMARTMAP: Operating system support for efficient data sharing among processes on a multi-core processor , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[6] Ada Gavrilovska,et al. SmartBlock: An Approach to Standardizing In Situ Workflow Components , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[7] Kevin Pedretti,et al. An Intra-Node Implementation of OpenSHMEM Using Virtual Address Space Mapping. , 2011 .

[8] Patrick Carribault,et al. MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption , 2009, PVM/MPI.

[9] Tao Yang,et al. Compile/run-time support for threaded MPI execution on multiprogrammed shared memory machines , 1999, PPoPP '99.

[10] Rajeev Thakur,et al. PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems , 2010, EuroMPI.

[11] Fumiyoshi Shoji,et al. The K computer: Japanese next-generation supercomputer development project , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[12] Torsten Hoefler,et al. NUMA-aware shared-memory collective communication for MPI , 2013, HPDC.

[13] Xin Zhao,et al. Scalable Memory Use in MPI: A Case Study with MPICH2 , 2011, EuroMPI.

[14] Thomas W. Jones,et al. WOMBAT: A Scalable and High-performance Astrophysical Magnetohydrodynamics Code , 2017, 1701.07452.

[15] Raymond Namyst,et al. MPC: A Unified Parallel Runtime for Clusters of NUMA Machines , 2008, Euro-Par.

[16] Satoshi Matsuoka,et al. MPI+Threads: runtime contention and remedies , 2015, PPOPP.

[17] James Dinan,et al. Enabling Efficient Multithreaded MPI Communication through a Library-Based Implementation of MPI Endpoints , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18] Yutaka Ishikawa. MPC++ approach to parallel computing environment , 1996, SIAP.

[19] Scott Klasky,et al. DataSpaces: an interaction and coordination framework for coupled simulation workflows , 2012, HPDC '10.

[20] Yutaka Ishikawa,et al. Proposing a new task model towards many-core architecture , 2013, MES '13.

[21] Rajeev Thakur,et al. Enabling MPI interoperability through flexible communication endpoints , 2013, EuroMPI.

[22] Patrick Carribault,et al. Hierarchical Local Storage: Exploiting Flexible User-Data Sharing Between MPI Tasks , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[23] Mitsuhisa Sato,et al. Productivity and Performance of Global-View Programming with XcalableMP PGAS Language , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[24] Karsten Schwan,et al. Flexpath: Type-Based Publish/Subscribe System for Large-Scale Science Analytics , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[25] Alex Brooks,et al. Argobots: A Lightweight Low-Level Threading and Tasking Framework , 2018, IEEE Transactions on Parallel and Distributed Systems.

[26] Laxmikant V. Kalé,et al. Automatic Handling of Global Variables for Multi-threaded MPI Programs , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[27] Steve Plimpton,et al. Fast parallel algorithms for short-range molecular dynamics , 1993 .

[28] Michael Woodacre. The SGI® Altix 3000 Global Shared-Memory Architecture , 2003 .

[29] Randal S. Baker. PARTISN on Advanced/Heterogeneous Processing Systems , 2013 .

[30] Yutaka Ishikawa,et al. On the Scalability, Performance Isolation and Device Driver Transparency of the IHK/McKernel Hybrid Lightweight Kernel , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[31] Rajeev Thakur,et al. Toward Efficient Support for Multithreaded MPI Communication , 2008, PVM/MPI.

[32] Erik D. Demaine,et al. A Threads-Only MPI Implementation for the Development of Parallel Programs , 1997 .

[33] Utkarsh Ayachit,et al. The SENSEI Generic In Situ Interface , 2016, 2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV).

[34] Karsten Schwan,et al. FlexIO: I/O Middleware for Location-Flexible Scientific Data Analytics , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[35] Torsten Hoefler,et al. Leveraging MPI's One-Sided Communication Interface for Shared-Memory Programming , 2012, EuroMPI.

[36] Yutaka Ishikawa,et al. Toward Operating System Support for Scalable Multithreaded Message Passing , 2015, EuroMPI.

[37] Dmitriy Morozov,et al. Master of Puppets: Cooperative Multitasking for In Situ Processing , 2016, HPDC.

[38] Jack Dongarra,et al. TOP500 Sublist for November 2001 , 2001 .

[39] Mathias Payer. Too much PIE is bad for performance , 2012 .

[40] Pavan Balaji,et al. Advanced Thread Synchronization for Multithreaded MPI Implementations , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[41] Douglas Thain,et al. Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[42] Juan Carlos Díaz Martín,et al. An MPI-1 Compliant Thread-Based Implementation , 2009, PVM/MPI.

[43] Patrick Carribault,et al. Thread-Local Storage Extension to Support Thread-Based MPI/OpenMP Applications , 2011, IWOMP.

[44] Torsten Hoefler,et al. Hybrid MPI: Efficient message passing for multi-core systems , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).