Traffic management: a holistic approach to memory placement on NUMA systems

NUMA systems are characterized by Non-Uniform Memory Access times, where accessing data in a remote node takes longer than a local access. NUMA hardware has been built since the late 80's, and the operating systems designed for it were optimized for access locality. They co-located memory pages with the threads that accessed them, so as to avoid the cost of remote accesses. Contrary to older systems, modern NUMA hardware has much smaller remote wire delays, and so remote access costs per se are not the main concern for performance, as we discovered in this work. Instead, congestion on memory controllers and interconnects, caused by memory traffic from data-intensive applications, hurts performance a lot more. Because of that, memory placement algorithms must be redesigned to target traffic congestion. This requires an arsenal of techniques that go beyond optimizing locality. In this paper we describe Carrefour, an algorithm that addresses this goal. We implemented Carrefour in Linux and obtained performance improvements of up to 3.6 relative to the default kernel, as well as significant improvements compared to NUMA-aware patchsets available for Linux. Carrefour never hurts performance by more than 4% when memory placement cannot be improved. We present the design of Carrefour, the challenges of implementing it on modern hardware, and draw insights about hardware support that would help optimize system software on future NUMA systems.

[1]  Carla Schlatter Ellis,et al.  Evaluation of NUMA Memory Management Through Modeling and Measurements , 1992, IEEE Trans. Parallel Distributed Syst..

[2]  Frank Bellosa,et al.  Resource-conscious scheduling for energy efficiency on multicore processors , 2010, EuroSys '10.

[3]  M. Frans Kaashoek,et al.  CPHASH: a cache-partitioned hash table , 2012, PPoPP '12.

[4]  Tim Brecht,et al.  On the importance of parallel application placement in NUMA multiprocessors , 1993 .

[5]  Gustavo Alonso,et al.  Database engines on multicores, why parallelize when you can distribute? , 2011, EuroSys '11.

[6]  Bruce A. Draper,et al.  The CSU Face Identification Evaluation System , 2005, Machine Vision and Applications.

[7]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[9]  Robert Tappan Morris,et al.  Locating cache performance bottlenecks using data profiling , 2010, EuroSys '10.

[10]  Ali Kamali SHARING AWARE SCHEDULING ON MULTICORE SYSTEMS , 2010 .

[11]  Jin Zhou,et al.  Memory management for many-core processors with software configurable locality policies , 2012, ISMM '12.

[12]  Michael L. Scott,et al.  Simple but effective techniques for NUMA memory management , 1989, SOSP '89.

[13]  Michael Stumm,et al.  Tornado: maximizing locality and concurrency in a shared memory multiprocessor operating system , 1999, OSDI '99.

[14]  Silas Boyd-Wickizer,et al.  A Software Approach to Unifying Multicore Caches , 2011 .

[15]  Haibo Chen,et al.  A case for scaling applications to many-core with OS clustering , 2011, EuroSys '11.

[16]  Yang Zhang,et al.  Corey: An Operating System for Many Cores , 2008, OSDI.

[17]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS 2010.

[18]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[19]  Alexandra Fedorova,et al.  A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Bill Dally Power, Programmability, and Granularity: The Challenges of ExaScale Computing , 2011, IPDPS.

[21]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[22]  Thomas R. Gross,et al.  Memory system performance in a NUMA multicore multiprocessor , 2011, SYSTOR '11.

[23]  Vivien Quéma,et al.  MemProf: A Memory Profiler for NUMA Multicore Systems , 2012, USENIX Annual Technical Conference.

[24]  Michael Stumm,et al.  Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.

[25]  Thomas R. Gross,et al.  Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead , 2011, ISMM '11.

[26]  Ippokratis Pandis,et al.  Data-oriented transaction execution , 2010, Proc. VLDB Endow..

[27]  Tong Li,et al.  Using OS Observations to Improve Performance in Multicore Systems , 2008, IEEE Micro.