Modeling memory system performance of NUMA multicore-multiprocessors

The performance of many applications depends closely on the way they interact with the computer’s memory system: Many applications obtain good performance only if they utilize the memory system efficiently. Unfortunately, obtaining good memory system performance is often difficult, as developing memory system-aware (system) software requires a thorough and detailed understanding of both the characteristics of the memory system and of the interaction of applications with the memory system. Moreover, the design of memory systems evolves as newer processor generations appear on the market, thus the problem of software–hardware interaction must be revisited to understand the interaction of (already existing) software with newer memory system designs as well. This thesis investigates the memory system performance of a recent class of machines, multicore-multiprocessors with a non-uniform memory architecture (NUMA). A NUMA multicore-multiprocessor system consists of several processors where each processor integrates multiple cores. Typically, cores of a multicore processor share resources (e.g., last-level caches) and contention for these shared resources can result in significant performance degradations. NUMA multicore-multiprocessors are shared-memory computers, but the memory space of a NUMA multicore-multiprocessor system is partitioned between processors. Accessing the memory of a local processor takes less time than accessing the memory of other (remote) processors, therefore data locality (a low number of remote memory accesses) is critical for good performance on NUMA machines. This thesis presents a performance-oriented model for NUMA multicore-multiprocessors. The model considers two application classes, multiprogrammed workloads (workloads that consist of multiple, independent processes) and multithreaded programs (programs that consist of a number of threads that operate in a shared address space). The thesis presents an experimental analysis of memory system bottlenecks experienced by each application class. Moreover, the thesis presents techniques to reduce the performance-degrading effects of these bottlenecks. We determine (based on experimental analysis) that the performance of multiprogrammed workloads depends on both multicore-specific and NUMA-specific aspects of a NUMA multicore-multiprocessor’s memory system. Therefore, a process scheduler must find a balance between reducing cache contention and improving data locality; the N-MASS scheduler presented by the thesis attempts to strike a balance between these, sometimes contradicting, goals. N-MASS improves performance up to 32% over the default setup in current Linux implementations on a recent 2-processor 8-core machine. Based also on experimental analysis we find that data locality is of prime importance for the performance of multithreaded programs. The thesis presents extensions to two popular parallel programming frameworks, OpenMP and Intel’s Threading Building Blocks. The extensions

[1]  Stéphane Eranian What can performance counters do for memory subsystem analysis? , 2008, MSPC '08.

[2]  Wolfgang E. Nagel,et al.  Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Ramesh Illikkal,et al.  Rate-based QoS techniques for cache/memory in CMP platforms , 2009, ICS.

[4]  Alexandra Fedorova,et al.  A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Ian Pratt,et al.  Multiprogramming Performance of the Pentium 4 with Hyper-Threading , 2004 .

[6]  Lingjia Tang,et al.  Directly characterizing cross core interference through contention synthesis , 2011, HiPEAC.

[7]  Jie Chen,et al.  Analysis and approximation of optimal co-scheduling on Chip Multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Tarek A. El-Ghazawi,et al.  An evaluation of global address space languages: co-array fortran and unified parallel C , 2005, PPoPP.

[9]  Margo I. Seltzer,et al.  Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design , 2005, USENIX Annual Technical Conference, General Track.

[10]  Vivien Quéma,et al.  MemProf: A Memory Profiler for NUMA Multicore Systems , 2012, USENIX Annual Technical Conference.

[11]  Stijn Eyerman,et al.  Probabilistic job symbiosis modeling for SMT processor scheduling , 2010, ASPLOS XV.

[12]  Yale N. Patt,et al.  Feedback-directed pipeline parallelism , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  Michael Stumm,et al.  Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.

[14]  Xiao Zhang,et al.  Optimizing Google's warehouse scale computers: The NUMA experience , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[15]  Scott A. Mahlke,et al.  Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[16]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[17]  Carole-Jean Wu,et al.  Characterization and dynamic mitigation of intra-application cache interference , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[18]  Dean M. Tullsen,et al.  Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.

[19]  Frank Mueller,et al.  Feedback-directed page placement for ccNUMA via hardware-generated memory traces , 2010, J. Parallel Distributed Comput..

[20]  Uday Bondhugula,et al.  Effective automatic computation placement and data allocation for parallelization of regular programs , 2014, ICS '14.

[21]  Zachary R. Anderson Efficiently combining parallel software using fine-grained, language-level, hierarchical resource management policies , 2012, OOPSLA '12.

[22]  Robert Tappan Morris,et al.  Locating cache performance bottlenecks using data profiling , 2010, EuroSys '10.

[23]  Kirk W. Cameron,et al.  Critical path-based thread placement for NUMA systems , 2011, PMBS '11.

[24]  Mahmut T. Kandemir,et al.  Cache topology aware computation mapping for multicores , 2010, PLDI '10.

[25]  Manuel Prieto,et al.  A comprehensive scheduler for asymmetric multicore systems , 2010, EuroSys '10.

[26]  Mahmut T. Kandemir,et al.  Adaptive set pinning: managing shared caches in chip multiprocessors , 2008, ASPLOS.

[27]  Eduard Ayguadé,et al.  A case for user-level dynamic page migration , 2000, ICS '00.

[28]  Eduard Ayguadé,et al.  Is Data Distribution Necessary in OpenMP? , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[29]  Michael L. Scott,et al.  Simple but effective techniques for NUMA memory management , 1989, SOSP '89.

[30]  Dheeraj Reddy,et al.  Bias scheduling in heterogeneous multi-core architectures , 2010, EuroSys '10.

[31]  Jonathan Harris,et al.  Extending OpenMP For NUMA Machines , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[32]  Mary Lou Soffa,et al.  Contention aware execution: online contention detection and response , 2010, CGO '10.

[33]  Thomas R. Gross,et al.  Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead , 2011, ISMM '11.

[34]  Christoph Lameter,et al.  Local and Remote Memory: Memory in a Linux/NUMA System , 2006 .

[35]  Robert J. Fowler,et al.  Modeling memory concurrency for multi-socket multi-core systems , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[36]  Quan Chen,et al.  CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures , 2012, ICS '12.

[37]  Matthias Hauswirth,et al.  Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[38]  Alexandra Fedorova,et al.  Contention-Aware Scheduling on Multicore Systems , 2010, TOCS.

[39]  Michael D. Smith,et al.  Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[40]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[41]  Wei Wang,et al.  ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity , 2013, ACM Trans. Archit. Code Optim..

[42]  Jeffrey K. Hollingsworth,et al.  NUMA-aware Java heaps for server applications , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[43]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[44]  Pradeep Dubey,et al.  Can traditional programming bridge the Ninja performance gap for parallel computing applications? , 2015, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[45]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[46]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[47]  Xipeng Shen,et al.  Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? , 2010, PPoPP '10.

[48]  Kenneth M. Wilson,et al.  Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C , 2001, SC.

[49]  David A. Padua,et al.  Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.

[50]  Nir Shavit,et al.  Flat-combining NUMA locks , 2011, SPAA '11.

[51]  Rafael Asenjo,et al.  Analytical Modeling of Pipeline Parallelism , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[52]  Jeffrey K. Hollingsworth,et al.  Hardware monitors for dynamic page migration , 2008, J. Parallel Distributed Comput..

[53]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[54]  James Charles,et al.  Evaluation of the Intel® Core™ i7 Turbo Boost feature , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[55]  Frank Mueller,et al.  Hardware profile-guided automatic page placement for ccNUMA systems , 2006, PPoPP '06.

[56]  Michael Frumkin,et al.  The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .

[57]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[58]  Robert J. Fowler,et al.  Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations , 2003, J. Parallel Distributed Comput..

[59]  Idit Keidar,et al.  SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools , 2012, SPAA '12.

[60]  Benjamin Hindman,et al.  Composing parallel software efficiently with lithe , 2010, PLDI '10.

[61]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[62]  David Eklov,et al.  Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[63]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[64]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[65]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[66]  Mohammad Banikazemi,et al.  PAM: A novel performance/power aware meta-scheduler for multi-core systems , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[67]  Tarek S. Abdelrahman,et al.  Automatic partitioning of data and computations on scalable shared memory multiprocessors , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[68]  Erez Petrank,et al.  Wait-free queues with multiple enqueuers and dequeuers , 2011, PPoPP '11.

[69]  John M. Mellor-Crummey,et al.  A tool to analyze the performance of multithreaded programs on NUMA architectures , 2014, PPoPP '14.

[70]  Michael Stumm,et al.  RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.

[71]  Lingjia Tang,et al.  Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures , 2011, EXADAPT '11.

[72]  John M. Mellor-Crummey,et al.  Pinpointing data locality problems using data-centric analysis , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[73]  Tong Li,et al.  Efficient operating system scheduling for performance-asymmetric multi-core architectures , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[74]  Guy E. Blelloch,et al.  The Data Locality of Work Stealing , 2002, SPAA '00.

[75]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[76]  Keshav Pingali,et al.  Access normalization: loop restructuring for NUMA compilers , 1992, ASPLOS V.

[77]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[78]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[79]  Vana Kalogeraki,et al.  FACT: a framework for adaptive contention-aware thread migrations , 2011, CF '11.

[80]  T. Gross,et al.  Asymmetries in Multi-Core Systems – Or Why We Need Better Performance Measurement Units , 2010 .

[81]  Alexandra Fedorova,et al.  Deconstructing the overhead in parallel applications , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[82]  Aleksandar Milenkovic,et al.  Demystifying Intel Branch Predictors , 2005 .

[83]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[84]  Rui Yang,et al.  Memory and Thread Placement Effects as a Function of Cache Usage: A Study of the Gaussian Chemistry Code on the SunFire X4600 M2 , 2008, 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008).

[85]  Yehuda Afek,et al.  Fast concurrent queues for x86 processors , 2013, PPoPP '13.

[86]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[87]  Nenad Nedeljkovic,et al.  Data distribution support on distributed shared memory multiprocessors , 1997, PLDI '97.

[88]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[89]  Jeffrey K. Hollingsworth,et al.  Using Hardware Counters to Automatically Improve Memory Performance , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[90]  Christoforos E. Kozyrakis,et al.  Dynamic Fine-Grain Scheduling of Pipeline Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[91]  Hui Li,et al.  Locality and Loop Scheduling on NUMA Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[92]  Susan J. Eggers,et al.  Impact of sharing-based thread placement on multithreaded architectures , 1994, ISCA '94.

[93]  Michael Voss,et al.  Optimization via Reflection on Work Stealing in TBB , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[94]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[95]  Michael Stumm,et al.  Online performance analysis by statistical sampling of microprocessor performance counters , 2005, ICS '05.

[96]  Takeshi Ogasawara NUMA-aware memory manager with dominant-thread-based copying GC , 2009, OOPSLA 2009.

[97]  Collin McCurdy,et al.  Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[98]  Tong Li,et al.  Using OS Observations to Improve Performance in Multicore Systems , 2008, IEEE Micro.