OS-Based NUMA Optimization: Tackling the Case of Truly Multi-thread Applications with Non-partitioned Virtual Page Accesses

A common approach to improve memory access in NUMA machines exploits operating system (OS) page protection mechanisms to induce faults to determine which pages are accessed by what thread, so as to move the thread and its working-set of pages to the same NUMA node. However, existing proposals do not fully fit the requirements of truly multi-thread applications with non-partitioned accesses to virtual pages. In fact, these proposals exploit (induced) faults on a same page-table for all the threads of a same process to determine the access pattern. Hence, the fault by one thread (and the consequent re-opening of the access to the corresponding page) would mask those by other threads on the same page. This may lead to inaccuracy in the estimation of the working-set of individual threads. We overcome this drawback by presenting a lightweight operating system support for Linux, referred to as multi-view address space, explicitly targeting accuracy of per-thread working-set estimation in truly multi-thread applications with non-partitioned accesses, and an associated thread/data migration policy. Our solution is fully transparent to user-space code. It is embedded in a Linux/x86_64 module that installs any required modification to the original kernel image by solely relying on dynamic patching. A motivated case study in the context of HPC is also presented for an assessment of our proposal.

[1]  Alessandro Pellegrini,et al.  The ROme OpTimistic Simulator: A Tutorial , 2013, Euro-Par Workshops.

[2]  Ananta Tiwari,et al.  PEBIL: binary instrumentation for practical data-intensive program analysis , 2013, Cluster Computing.

[3]  Xiaofeng Gao,et al.  Reducing overheads for acquiring dynamic memory traces , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[4]  Philippe Olivier Alexandre Navaux,et al.  Optimizing Memory Locality Using a Locality-Aware Page Table , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[5]  Roberto Vitali,et al.  Autonomic State Management for Optimistic Simulation Platforms , 2015, IEEE Transactions on Parallel and Distributed Systems.

[6]  Alessandro Pellegrini,et al.  NUMA Time Warp , 2015, SIGSIM-PADS.

[7]  Richard M. Fujimoto,et al.  Adaptive memory management and optimism control in time warp , 1997, TOMC.

[8]  R. Fujimoto,et al.  Buffer management in shared-memory time warp systems , 1995, Proceedings 9th Workshop on Parallel and Distributed Simulation (ACM/IEEE).

[9]  Frank Mueller,et al.  Hardware profile-guided automatic page placement for ccNUMA systems , 2006, PPoPP '06.

[10]  Zizhong Chen,et al.  Optimizing Process-to-Core Mappings for Application Level Multi-dimensional MPI Communications , 2012, 2012 IEEE International Conference on Cluster Computing.

[11]  Henri Casanova,et al.  On cluster resource allocation for multiple parallel task graphs , 2010, J. Parallel Distributed Comput..

[12]  Jack J. Dongarra,et al.  EZTrace: A Generic Framework for Performance Analysis , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[13]  Philippe Olivier Alexandre Navaux,et al.  Evaluating Thread Placement Based on Memory Access Patterns for Multi-core Processors , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).

[14]  Xiaofeng Gao,et al.  ALITER: an asynchronous lightweight instrumentation tool for event recording , 2005, CARN.

[15]  Philippe Olivier Alexandre Navaux,et al.  kMAF: Automatic kernel-level management of thread and data affinity , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[16]  Laura Hoch Understanding The Linux Virtual Memory Manager , 2016 .

[17]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  Fernando Magno Quintão Pereira,et al.  Compiler support for selective page migration in NUMA architectures , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[19]  Simon W. Moore,et al.  A communication characterisation of Splash-2 and Parsec , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[20]  Nael B. Abu-Ghazaleh,et al.  Parallel Discrete Event Simulation for Multi-Core Systems: Analysis and Optimization , 2014, IEEE Transactions on Parallel and Distributed Systems.

[21]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[22]  Wenguang Chen,et al.  MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters , 2006, ICS '06.

[23]  Roberto Vitali,et al.  Load sharing for optimistic parallel simulations on multi core machines , 2012, PERV.

[24]  Nael B. Abu-Ghazaleh,et al.  Optimization of Parallel Discrete Event Simulator for Multi-core Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[25]  Danny Hendler,et al.  Exploiting Locality in Lease-Based Replicated Transactional Memory via Task Migration , 2013, DISC.

[26]  Jeffrey K. Hollingsworth,et al.  Hardware monitors for dynamic page migration , 2008, J. Parallel Distributed Comput..