Optimizing memory affinity with a hybrid compiler/OS approach

Optimizing the memory access behavior is an important challenge to improve the performance and energy consumption of parallel applications on shared memory architectures. Modern systems contain complex memory hierarchies with multiple memory controllers and several levels of caches. In such machines, analyzing the affinity between threads and data to map them to the hardware hierarchy reduces the cost of memory accesses. In this paper, we introduce a hybrid technique to optimize the memory access behavior of parallel applications. It is based on a compiler optimization that inserts code to predict, at runtime, the memory access behavior of the application and an OS mechanism that uses this information to optimize the mapping of threads and data. In contrast to previous work, our proposal uses a proactive technique to improve the future memory access behavior using predictions instead of the past behavior. Our mechanism achieves substantial performance gains for a variety of parallel applications.

[1]  Philippe Olivier Alexandre Navaux,et al.  Kernel-Based Thread and Data Mapping for Improved Memory Affinity , 2016, IEEE Transactions on Parallel and Distributed Systems.

[2]  Frank Mueller,et al.  Hardware profile-guided automatic page placement for ccNUMA systems , 2006, PPoPP '06.

[3]  Jean-François Méhaut,et al.  Memory Affinity for Hierarchical Shared Memory Multiprocessors , 2009, 2009 21st International Symposium on Computer Architecture and High Performance Computing.

[4]  Frank Mueller,et al.  Feedback-directed page placement for ccNUMA via hardware-generated memory traces , 2010, J. Parallel Distributed Comput..

[5]  Christoph Lameter,et al.  An overview of non-uniform memory access , 2013, CACM.

[6]  Philippe Olivier Alexandre Navaux,et al.  kMAF: Automatic kernel-level management of thread and data affinity , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[7]  Sverker Holmgren,et al.  affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system , 2005, ICS '05.

[8]  Wei Wang,et al.  Performance analysis of thread mappings with a holistic view of the hardware resources , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[9]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[10]  Philippe Olivier Alexandre Navaux,et al.  Locality vs. Balance: Exploring Data Mapping Policies on NUMA Systems , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[11]  Philippe Olivier Alexandre Navaux,et al.  LAPT: A locality-aware page table for thread and data mapping , 2016, Parallel Comput..

[12]  Thomas R. Gross,et al.  Matching memory access patterns and data placement for NUMA systems , 2012, CGO '12.

[13]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[14]  Philippe Olivier Alexandre Navaux,et al.  An Efficient Algorithm for Communication-Based Task Mapping , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[15]  Philippe Olivier Alexandre Navaux,et al.  Characterizing communication and page usage of parallel applications for thread and data mapping , 2015, Perform. Evaluation.

[16]  Philippe Olivier Alexandre Navaux,et al.  Hardware-Assisted Thread and Data Mapping in Hierarchical Multicore Architectures , 2016, ACM Trans. Archit. Code Optim..

[17]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[18]  Jean-François Méhaut,et al.  Improving Memory Affinity of Geophysics Applications on NUMA Platforms Using Minas , 2010, VECPAR.

[19]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[20]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[21]  Ricardo Bianchini,et al.  Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[22]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Fernando Magno Quintão Pereira,et al.  Compiler support for selective page migration in NUMA architectures , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[24]  Jeffrey K. Hollingsworth,et al.  Using Hardware Counters to Automatically Improve Memory Performance , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[25]  Jeffrey K. Hollingsworth,et al.  Hardware monitors for dynamic page migration , 2008, J. Parallel Distributed Comput..

[26]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27]  Carla Schlatter Ellis,et al.  An analysis of dynamic page placement on a NUMA multiprocessor , 1992, SIGMETRICS '92/PERFORMANCE '92.

[28]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[29]  Eduard Ayguadé,et al.  UPMLIB: A Runtime System for Tuning the Memory Performance of OpenMP Programs on Scalable Shared-Memory Multiprocessors , 2000, LCR.