Non-uniform Memory Affinity Strategy in Multi-Threaded Sparse Matrix Computations

As the core counts on modern multi-processor systems increase, so does the memory contention with all the processes/threads trying to access the main memory simultaneously. This is typical of UMA (Uniform Memory Access) architectures with a single physical memory bank leading to poor scalability in multi-threaded applications. To palliate this problem, modern systems are moving increasingly towards Non-Uniform Memory Access (NUMA) architectures, in which the physical memory is split into several (typically two or four) banks. Each memory bank is associated with a set of cores enabling threads to operate from their own physical memory banks while retaining the concept of a shared virtual address space. However, accessing shared data structures from the remote memory banks may become increasingly slow. This paper proposes a way to determine and pin certain parts of the shared data to specific memory banks, thus minimizing remote accesses. To achieve this, the existing application code has be supplied with the proposed interface to set-up and distribute the shared data appropriately among memory banks. Experiments with NAS benchmark as well as with a realistic large-scale application calculating ab-initio nuclear structure have been performed. Speedups of up to 3.5 times were observed with the proposed approach compared with the default memory placement policy.

[1]  Jean-François Méhaut,et al.  Memory Affinity for Hierarchical Shared Memory Multiprocessors , 2009, 2009 21st International Symposium on Computer Architecture and High Performance Computing.

[2]  Laxmi N. Bhuyan,et al.  Design and analysis of static memory management policies for CC-NUMA multiprocessors , 2002, J. Syst. Archit..

[3]  Dirk Schmidl,et al.  Data and thread affinity in openmp programs , 2008, MAW '08.

[4]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[5]  Chao Yang,et al.  Accelerating configuration interaction calculations for nuclear structure , 2008, HiPC 2008.

[6]  Jesús Labarta,et al.  Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000 , 2003, ICS '03.

[7]  Masha Sosonkina,et al.  Dynamic Adaptations in ab-initio Nuclear Physics Calculations on Multicore Computer Architectures , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[8]  Carla Schlatter Ellis,et al.  Evaluation of NUMA Memory Management Through Modeling and Measurements , 1992, IEEE Trans. Parallel Distributed Syst..

[9]  Christoph Lameter,et al.  Local and Remote Memory: Memory in a Linux/NUMA System , 2006 .

[10]  Rui Yang,et al.  Profiling Directed NUMA Optimization on Linux Systems: A Case Study of the Gaussian Computational Chemistry Code , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[11]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[12]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[13]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[14]  Brice Goglin,et al.  Enabling high-performance memory migration for multithreaded applications on LINUX , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[15]  Frank Bellosa,et al.  The Performance Limits of Locality Information Usage in Shared-Memory Multiprocessors , 1996, J. Parallel Distributed Comput..

[16]  Michael Frumkin,et al.  The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .

[17]  Ryan E. Grant,et al.  A Comprehensive Analysis of OpenMP Applications on Dual-Core Intel Xeon SMPs , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[18]  Masha Sosonkina,et al.  Scaling of ab-initio nuclear physics calculations on multicore computer architectures , 2010, ICCS.

[19]  Joseph Antony,et al.  Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport , 2006, HiPC.

[20]  Alexandra Fedorova,et al.  A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[21]  Masha Sosonkina,et al.  Accelerating Full Configuration Interaction Calculations for Nuclear Structure , 2008 .

[22]  Tong Li,et al.  Efficient operating system scheduling for performance-asymmetric multi-core architectures , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).