Thread and Memory Placement on NUMA Systems: Asymmetry Matters

It is well known that the placement of threads and memory plays a crucial role for performance on NUMA (Non-Uniform Memory-Access) systems. The conventional wisdom is to place threads close to their memory, to collocate on the same node threads that share data, and to segregate on different nodes threads that compete for memory bandwidth or cache resources. While many studies addressed thread and data placement, none of them considered a crucial property of modern NUMA systems that is likely to prevail in the future: asymmetric interconnect. When the nodes are connected by links of different bandwidth, we must consider not only whether the threads and data are placed on the same or different nodes, but how these nodes are connected. We study the effects of asymmetry on a widely available ×86 system and find that performance can vary by more than 2× under the same distribution of thread and data across the nodes but different inter-node connectivity. The key new insight is that the best-performing connectivity is the one with the greatest total bandwidth as opposed to the smallest number of hops. Based on our findings we designed and implemented a dynamic thread and memory placement algorithm in Linux that delivers similar or better performance than the best static placement and up to 218% better performance than when the placement is chosen randomly.

[1]  Ali Kamali SHARING AWARE SCHEDULING ON MULTICORE SYSTEMS , 2010 .

[2]  Francisco J. Cazorla,et al.  Thread to strand binding of parallel network applications in massive multi-threaded systems , 2010, PPoPP '10.

[3]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[4]  Manuel Prieto,et al.  A comprehensive scheduler for asymmetric multicore systems , 2010, EuroSys '10.

[5]  Vivien Quéma,et al.  MemProf: A Memory Profiler for NUMA Multicore Systems , 2012, USENIX Annual Technical Conference.

[6]  Michael L. Scott,et al.  Simple but effective techniques for NUMA memory management , 1989, SOSP '89.

[7]  Michael Stumm,et al.  Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.

[8]  Tong Li,et al.  Efficient operating system scheduling for performance-asymmetric multi-core architectures , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[9]  Tim Brecht,et al.  On the importance of parallel application placement in NUMA multiprocessors , 1993 .

[10]  Frank Bellosa,et al.  Resource-conscious scheduling for energy efficiency on multicore processors , 2010, EuroSys '10.

[11]  Francisco J. Cazorla,et al.  Optimal task assignment in multithreaded processors: a statistical approach , 2012, ASPLOS XVII.

[12]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  Xiao Zhang,et al.  Optimizing Google's warehouse scale computers: The NUMA experience , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[14]  Thomas R. Gross,et al.  Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead , 2011, ISMM '11.

[15]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[16]  Alec Wolman,et al.  Volley: Automated Data Placement for Geo-Distributed Cloud Services , 2010, NSDI.

[17]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[18]  Alexandra Fedorova,et al.  A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[20]  Ali Kamali,et al.  AASH: an asymmetry-aware scheduler for hypervisors , 2010, VEE '10.

[21]  Chris Johnson,et al.  Data Distribution , Migration and Replication on a cc-NUMA Architecture , 2002 .

[22]  Xavier Lorca,et al.  Entropy: a consolidation manager for clusters , 2009, VEE '09.

[23]  Carla Schlatter Ellis,et al.  Evaluation of NUMA Memory Management Through Modeling and Measurements , 1992, IEEE Trans. Parallel Distributed Syst..

[24]  Dheeraj Reddy,et al.  Bias scheduling in heterogeneous multi-core architectures , 2010, EuroSys '10.

[25]  Laxmi N. Bhuyan,et al.  ADAPT: A framework for coscheduling multithreaded programs , 2013, TACO.

[26]  Tong Li,et al.  Using OS Observations to Improve Performance in Multicore Systems , 2008, IEEE Micro.

[27]  Thomas R. Gross,et al.  Memory system performance in a NUMA multicore multiprocessor , 2011, SYSTOR '11.