Communication-aware process and thread mapping using online communication detection

We perform online detection of inter-process and inter-thread communication.Detected communication pattern is used to migrate processes and threads.Operating System-based mechanism, no changes to applications or runtime libraries.We reduce execution time and energy consumption.Evaluation on shared memory machines and a cluster show substantial improvements. The rising complexity of memory hierarchies and interconnections in parallel shared memory architectures leads to differences in the communication performance. These differences can be exploited to perform a communication-aware mapping of parallel applications to the hardware topology, improving their performance and energy efficiency. To perform the mapping, it is necessary to determine the communication behavior of the processes and threads of the application. Previous methods rely on static communication traces to detect communication, require hardware changes or support only a subset of parallelization models.We propose CDSM, Communication Detection in Shared Memory, a mechanism that detects communication in from page faults and uses this information to perform the mapping. CDSM works on the operating system level during the execution of the parallel application and supports all parallelization models that use shared memory for communication. It does not require modifications to the applications, previous knowledge about their behavior, or changes to the hardware and runtime libraries. Experiments with the MPI, MPI+OpenMP and OpenMP implementations of the NAS parallel benchmarks, the HPCC benchmark and the PARSEC benchmark suite on a shared memory machine show that CDSM has a high detection accuracy with a negligible overhead. Execution time and processor energy consumption were reduced by up to 35.9% and 18.9%, respectively (10.2% and 7.3%, on average). Experiments on a cluster system, where CDSM optimizes the communication within each node, showed an average execution time reduction of 10.4%.

[1]  Bernd Hamann,et al.  Mapping applications with collectives over sub-communicators on torus networks , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Josep Torrellas Architectures for Extreme-Scale Computing , 2009, Computer.

[3]  Guillaume Mercier,et al.  Towards an Efficient Process Placement Policy for MPI Applications in Multicore Environments , 2009, PVM/MPI.

[4]  William Gropp,et al.  User's Guide for MPE: Extensions for MPI Programs , 1998 .

[5]  Rob H. Bisseling,et al.  Parallel hypergraph partitioning for scientific computing , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[6]  Philippe Olivier Alexandre Navaux,et al.  Communication-Based Mapping Using Shared Pages , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[7]  Jesper Larsson Träff Implementing the MPI process topology mechanism , 2002, SC '02.

[8]  Philippe Olivier Alexandre Navaux,et al.  Using the Translation Lookaside Buffer to Map Threads in Parallel Applications Based on Shared Memory , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[9]  Brice Goglin,et al.  KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework , 2013, J. Parallel Distributed Comput..

[10]  William Gropp,et al.  MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.

[11]  Yurii A. Vlasov,et al.  Technologies for exascale systems , 2011, IBM J. Res. Dev..

[12]  Dong Li,et al.  Hybrid MPI/OpenMP power-aware computing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[13]  Dhabaleswar K. Panda,et al.  Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[15]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[16]  Ian K. T. Tan,et al.  Towards achieving fairness in the Linux scheduler , 2008, OPSR.

[17]  Philippe Olivier Alexandre Navaux,et al.  Multi-core aware process mapping and its impact on communication overhead of parallel applications , 2009, 2009 IEEE Symposium on Computers and Communications.

[18]  Michael Stumm,et al.  Enhancing operating system support for multicore processors by using hardware performance monitoring , 2009, OPSR.

[19]  Samuel Thibault,et al.  Structuring the execution of OpenMP applications for multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[20]  Jack Dongarra,et al.  Introduction to the HPCChallenge Benchmark Suite , 2004 .

[21]  Kenji Ono,et al.  Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments , 2013 .

[22]  Jeffrey M. Squyres,et al.  Locality-Aware Parallel Process Mapping for Multi-core HPC Systems , 2011, 2011 IEEE International Conference on Cluster Computing.

[23]  F. Pellegrini,et al.  Static mapping by dual recursive bipartitioning of process architecture graphs , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[24]  R. Pielke,et al.  A comprehensive meteorological modeling system—RAMS , 1992 .

[25]  Laxmikant V. Kale,et al.  Automating Topology Aware Mapping for Supercomputers , 2010 .

[26]  Emmanuel Jeannot,et al.  Improving MPI Applications Performance on Multicore Clusters with Rank Reordering , 2011, EuroMPI.

[27]  Emmanuel Jeannot,et al.  Process Placement in Multicore Clusters:Algorithmic Issues and Practical Techniques , 2014, IEEE Transactions on Parallel and Distributed Systems.

[28]  Guillaume Mercier,et al.  Implementation and Shared-Memory Evaluation of MPICH2 over the Nemesis Communication Subsystem , 2006, PVM/MPI.

[29]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[30]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[31]  Andrew A. Chien,et al.  The future of microprocessors , 2011, Commun. ACM.

[32]  Michael Ott,et al.  autopin - Automated Optimization of Thread-to-Core Pinning on Multicore Systems , 2011, Trans. High Perform. Embed. Archit. Compil..

[33]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[34]  Frank Mueller,et al.  Feedback-directed page placement for ccNUMA via hardware-generated memory traces , 2010, J. Parallel Distributed Comput..

[35]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[36]  Laxmikant V. Kalé,et al.  Topology-aware task mapping for reducing communication contention on large parallel machines , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[37]  Naixue Xiong,et al.  An approach for matching communication patterns in parallel applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[38]  Torsten Hoefler,et al.  Generic topology mapping strategies for large-scale parallel architectures , 2011, ICS '11.

[39]  T. N. Vijaykumar,et al.  Optimizing Replication, Communication, and Capacity Allocation in CMPs , 2005, ISCA 2005.

[40]  Rolf Riesen,et al.  Communication patterns , 2006 .

[41]  Shahid H. Bokhari,et al.  On the Mapping Problem , 1981, IEEE Transactions on Computers.

[42]  B. Brandfass,et al.  Rank reordering for MPI communication optimization , 2013 .

[43]  Jonathan Green,et al.  Multi-core and Network Aware MPI Topology Functions , 2011, EuroMPI.

[44]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[45]  Rolf Riesen,et al.  Communication patterns [message-passing patterns] , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[46]  R. Vanderwijngaart,et al.  NAS Parallel Benchmarks, Multi-Zone Versions , 2003 .

[47]  Dhabaleswar K. Panda,et al.  Design of network topology aware scheduling services for large InfiniBand clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[48]  Jack J. Dongarra,et al.  EZTrace: A Generic Framework for Performance Analysis , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[49]  Stephen L. Olivier,et al.  Exploiting Geometric Partitioning in Task Mapping for Parallel Computers , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[50]  Zizhong Chen,et al.  Optimizing Process-to-Core Mappings for Application Level Multi-dimensional MPI Communications , 2012, 2012 IEEE International Conference on Cluster Computing.

[51]  Michael Frumkin,et al.  The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .

[52]  Nectarios Koziris,et al.  Performance comparison of pure MPI vs hybrid MPI-OpenMP parallelization models on SMP clusters , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[53]  S. Freitas,et al.  The Coupled Aerosol and Tracer Transport model to the Brazilian developments on the Regional Atmospheric Modeling System (CATT-BRAMS) – Part 1: Model description and evaluation , 2007 .

[54]  Wenguang Chen,et al.  MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters , 2006, ICS '06.

[55]  Wenguang Chen,et al.  Efficiently Acquiring Communication Traces for Large-Scale Parallel Applications , 2011, IEEE Transactions on Parallel and Distributed Systems.