Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines

Modern NUMA platforms offer large numbers of cores to boost performance through parallelism and multi-threading. However, because performance scalability is limited by available memory bandwidth, the strategy of allocating all cores can result in degraded performance. Consequently, accurately predicting optimal (best performing) core allocations, and executing applications with these allocations are crucial for achieving the best performance. Previous research focused on the prediction of optimal numbers of cores. However, in this paper, we show that, because of the asymmetric NUMA memory configuration and the asymmetric application memory behavior, optimal core allocations are not merely optimal numbers of cores. Additionally, previous studies do not adequately consider NUMA memory resources, which further limits their ability to accurately predict optimal core allocations. In this paper, we present a model, NuCore, which predicts both memory bandwidth usage and optimal core allocations. NuCore considers various memory resources and NUMA asymmetry, and employs Integer Programming to achieve high accuracy and low overhead. Experimental results from real NUMA machines show that the core allocations predicted by NuCore provide 1.27x average speedup over using all cores with only 75.6% cores allocated. NuCore also provides 1.18x and 1.21x average speedups over two state-of-the-art techniques. Our results also show that NuCore faithfully models NUMA memory systems and predicts memory bandwidth usages with only 10% average error.

[1]  Isaac D. Scherson,et al.  NUMA-Aware Multicore Matrix Multiplication , 2014, Parallel Process. Lett..

[2]  Hiroshi Nakamura,et al.  Scalability-based manycore partitioning , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Vivien Quéma,et al.  Thread and Memory Placement on NUMA Systems: Asymmetry Matters , 2015, USENIX Annual Technical Conference.

[4]  Thomas R. Gross,et al.  (Mis)understanding the NUMA memory system performance of multithreaded workloads , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[5]  Modeling memory system performance of NUMA multicore-multiprocessors , 2014 .

[6]  Jean-François Méhaut,et al.  MAi: Memory Affinity interface , 2008 .

[7]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  Jian Li,et al.  Dynamic power-performance adaptation of parallel computation on chip multiprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[9]  Lizhong Chen,et al.  An Analytical Performance Model for Partitioning Off-Chip Memory Bandwidth , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[10]  Chen Ding,et al.  Cache Conscious Task Regrouping on Multicore Processors , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[11]  Rob J Hyndman,et al.  Another look at measures of forecast accuracy , 2006 .

[12]  Gabriel H. Loh,et al.  Dynamic Classification of Program Memory Behaviors in CMPs , 2008 .

[13]  John M. Mellor-Crummey,et al.  A tool to analyze the performance of multithreaded programs on NUMA architectures , 2014, PPoPP '14.

[14]  Michael Stumm,et al.  RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.

[15]  Michael Frumkin,et al.  The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .

[16]  Ching-Yung Lin,et al.  Cache-conscious graph collaborative filtering on multi-socket multicore systems , 2014, Conf. Computing Frontiers.

[17]  Enrique Castillo Extreme value theory in engineering , 1988 .

[18]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[19]  David Eklov,et al.  Fast modeling of shared caches in multicore systems , 2011, HiPEAC.

[20]  Lieven Eeckhout,et al.  Undersubscribed threading on clustered cache architectures , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[21]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[22]  Tobias Achterberg,et al.  SCIP: solving constraint integer programs , 2009, Math. Program. Comput..

[23]  David Eklov,et al.  Bandwidth Bandit: Quantitative characterization of memory contention , 2012, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[24]  Thomas R. Gross,et al.  Memory system performance in a NUMA multicore multiprocessor , 2011, SYSTOR '11.

[25]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[26]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[27]  Pranith Kumar,et al.  Predicting Potential Speedup of Serial Code via Lightweight Profiling and Emulations with Memory Performance Model , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[28]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[29]  Babak Falsafi,et al.  BuMP: Bulk Memory Access Prediction and Streaming , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[30]  Scott A. Mahlke,et al.  Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.

[31]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[32]  Shirley Moore,et al.  Non-determinism and overcount on modern hardware performance counter implementations , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[33]  Margaret Martonosi,et al.  MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[34]  Gabor Pataki,et al.  Basis reduction and the complexity of branch-and-bound , 2009, SODA '10.

[35]  J. Demmel,et al.  Sun Microsystems , 1996 .

[36]  Jonathan Harris,et al.  Extending OpenMP For NUMA Machines , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[37]  Zhen Fang,et al.  Active memory controller , 2012, The Journal of Supercomputing.

[38]  Hendrik W. Lenstra,et al.  Integer Programming with a Fixed Number of Variables , 1983, Math. Oper. Res..

[39]  Xipeng Shen,et al.  Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? , 2010, PPoPP '10.

[40]  J. Teugels,et al.  Statistics of Extremes , 2004 .

[41]  Alexandra Fedorova,et al.  A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[42]  Scott A. Mahlke,et al.  When less is more (LIMO):controlled parallelism forimproved efficiency , 2012, CASES '12.

[43]  Gurindar S. Sohi,et al.  Adaptive, efficient, parallel execution of parallel programs , 2014, PLDI.

[44]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[45]  S SohiGurindar,et al.  Adaptive, efficient, parallel execution of parallel programs , 2014 .

[46]  Mary Lou Soffa,et al.  DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[47]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[48]  Jinkyu Jeong,et al.  A fully associative, tagless DRAM cache , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[49]  Carl Staelin,et al.  Memory hierarchy performance measurement of commercial dual-core desktop processors , 2008, J. Syst. Archit..

[50]  Nathan Clark,et al.  Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications , 2010, ISCA.

[51]  Karthikeyan Sankaralingam,et al.  A general constraint-centric scheduling framework for spatial architectures , 2013, PLDI.

[52]  Laxmi N. Bhuyan,et al.  Thread reinforcer: Dynamically determining number of threads via OS level monitoring , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[53]  Bruce R. Childers,et al.  Inflation and deflation of self-adaptive applications , 2011, SEAMS '11.

[54]  Francisco J. Cazorla,et al.  Optimal task assignment in multithreaded processors: a statistical approach , 2012, ASPLOS XVII.

[55]  Harvey M. Wagner,et al.  An integer linear‐programming model for machine scheduling , 1959 .