Hardware-Assisted Thread and Data Mapping in Hierarchical Multicore Architectures

The performance and energy efficiency of modern architectures depend on memory locality, which can be improved by thread and data mappings considering the memory access behavior of parallel applications. In this article, we propose intense pages mapping, a mechanism that analyzes the memory access behavior using information about the time the entry of each page resides in the translation lookaside buffer. It provides accurate information with a very low overhead. We present experimental results with simulation and real machines, with average performance improvements of 13.7% and energy savings of 4.4%, which come from reductions in cache misses and interconnection traffic.

[1]  Jean-François Méhaut,et al.  Memory Affinity for Hierarchical Shared Memory Multiprocessors , 2009, 2009 21st International Symposium on Computer Architecture and High Performance Computing.

[2]  Dirk Schmidl,et al.  Data and thread affinity in openmp programs , 2008, MAW '08.

[3]  Frank Mueller,et al.  Feedback-directed page placement for ccNUMA via hardware-generated memory traces , 2010, J. Parallel Distributed Comput..

[4]  Alessandro Pellegrini,et al.  OS-Based NUMA Optimization: Tackling the Case of Truly Multi-thread Applications with Non-partitioned Virtual Page Accesses , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[5]  Yurii A. Vlasov,et al.  Technologies for exascale systems , 2011, IBM J. Res. Dev..

[6]  Fernando Magno Quintão Pereira,et al.  Compiler support for selective page migration in NUMA architectures , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[7]  Simon W. Moore,et al.  A communication characterisation of Splash-2 and Parsec , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[8]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[9]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[10]  Philippe Olivier Alexandre Navaux,et al.  An Efficient Algorithm for Communication-Based Task Mapping , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[11]  Hermann Lederer,et al.  Parallel Computing: From Multicores and GPU's to Petascale , 2010 .

[12]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[13]  Antonio Robles,et al.  Increasing the Effectiveness of Directory Caches by Avoiding the Tracking of Noncoherent Memory Blocks , 2013, IEEE Transactions on Computers.

[14]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[15]  Josep Torrellas Architectures for Extreme-Scale Computing , 2009, Computer.

[16]  S. Eranian Perfmon2: a flexible performance monitoring interface for Linux , 2010 .

[17]  L PillaLaércio,et al.  Hardware-Assisted Thread and Data Mapping in Hierarchical Multicore Architectures , 2016 .

[18]  Jean Roman,et al.  Exploiting Intensive Multithreading for the Efficient Simulation of 3D Seismic Wave Propagation , 2008, 2008 11th IEEE International Conference on Computational Science and Engineering.

[19]  Manuel Prieto,et al.  Survey of scheduling techniques for addressing shared resources in multicore processors , 2012, CSUR.

[20]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[21]  Francisco J. Cazorla,et al.  Thread Assignment of Multithreaded Network Applications in Multicore/Multithreaded Processors , 2013, IEEE Transactions on Parallel and Distributed Systems.

[22]  Sverker Holmgren,et al.  affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system , 2005, ICS '05.

[23]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[24]  Milo M. K. Martin,et al.  Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[25]  Jean-François Méhaut,et al.  Parallel simulations of seismic wave propagation on NUMA architectures , 2009, PARCO.

[26]  Frank Mueller,et al.  Hardware profile-guided automatic page placement for ccNUMA systems , 2006, PPoPP '06.

[27]  Michael Frumkin,et al.  The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .

[28]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[29]  Philippe Olivier Alexandre Navaux,et al.  Communication-aware process and thread mapping using online communication detection , 2015, Parallel Comput..

[30]  Michael Ott,et al.  autopin - Automated Optimization of Thread-to-Core Pinning on Multicore Systems , 2011, Trans. High Perform. Embed. Archit. Compil..

[31]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[32]  Anoop Gupta,et al.  OS Support for Improving Data Locality on CC-NUMA Compute Servers , 1996 .

[33]  Oded Lempel,et al.  2nd Generation Intel® Core Processor Family: Intel® Core i7, i5 and i3 , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[34]  Philippe Olivier Alexandre Navaux,et al.  kMAF: Automatic kernel-level management of thread and data affinity , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[35]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[36]  Rob H. Bisseling,et al.  Parallel hypergraph partitioning for scientific computing , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[37]  Takeshi Ogasawara NUMA-aware memory manager with dominant-thread-based copying GC , 2009, OOPSLA.

[38]  Samuel Thibault,et al.  Structuring the execution of OpenMP applications for multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[39]  Jeffrey K. Hollingsworth,et al.  Hardware monitors for dynamic page migration , 2008, J. Parallel Distributed Comput..

[40]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[41]  Carla Schlatter Ellis,et al.  An analysis of dynamic page placement on a NUMA multiprocessor , 1992, SIGMETRICS '92/PERFORMANCE '92.

[42]  Emmanuel Jeannot,et al.  Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures , 2010, Euro-Par.

[43]  Michael Stumm,et al.  Enhancing operating system support for multicore processors by using hardware performance monitoring , 2009, OPSR.

[44]  José Duato,et al.  Understanding Cache Hierarchy Contention in CMPs to Improve Job Scheduling , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[45]  Aamer Jaleel,et al.  Analyzing Parallel Programs with PIN , 2010, Computer.