Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines
[1] Isaac D. Scherson,et al. NUMA-Aware Multicore Matrix Multiplication , 2014, Parallel Process. Lett..
[2] Hiroshi Nakamura,et al. Scalability-based manycore partitioning , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[3] Vivien Quéma,et al. Thread and Memory Placement on NUMA Systems: Asymmetry Matters , 2015, USENIX Annual Technical Conference.
[4] Thomas R. Gross,et al. (Mis)understanding the NUMA memory system performance of multithreaded workloads , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).
[5] Modeling memory system performance of NUMA multicore-multiprocessors , 2014 .
[6] Jean-François Méhaut,et al. MAi: Memory Affinity interface , 2008 .
[7] David W. Nellans,et al. Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[8] Jian Li,et al. Dynamic power-performance adaptation of parallel computation on chip multiprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..
[9] Lizhong Chen,et al. An Analytical Performance Model for Partitioning Off-Chip Memory Bandwidth , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[10] Chen Ding,et al. Cache Conscious Task Regrouping on Multicore Processors , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).
[11] Rob J Hyndman,et al. Another look at measures of forecast accuracy , 2006 .
[12] Gabriel H. Loh,et al. Dynamic Classification of Program Memory Behaviors in CMPs , 2008 .
[13] John M. Mellor-Crummey,et al. A tool to analyze the performance of multithreaded programs on NUMA architectures , 2014, PPoPP '14.
[14] Michael Stumm,et al. RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.
[15] Michael Frumkin,et al. The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .
[16] Ching-Yung Lin,et al. Cache-conscious graph collaborative filtering on multi-socket multicore systems , 2014, Conf. Computing Frontiers.
[17] Enrique Castillo. Extreme value theory in engineering , 1988 .
[18] Robert Tappan Morris,et al. An Analysis of Linux Scalability to Many Cores , 2010, OSDI.
[19] David Eklov,et al. Fast modeling of shared caches in multicore systems , 2011, HiPEAC.
[20] Lieven Eeckhout,et al. Undersubscribed threading on clustered cache architectures , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[21] Vivien Quéma,et al. Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.
[22] Tobias Achterberg,et al. SCIP: solving constraint integer programs , 2009, Math. Program. Comput..
[23] David Eklov,et al. Bandwidth Bandit: Quantitative characterization of memory contention , 2012, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[24] Thomas R. Gross,et al. Memory system performance in a NUMA multicore multiprocessor , 2011, SYSTOR '11.
[25] Yale N. Patt,et al. Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.
[26] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[27] Pranith Kumar,et al. Predicting Potential Speedup of Serial Code via Lightweight Profiling and Emulations with Memory Performance Model , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[28] Christian Bienia,et al. Benchmarking modern multiprocessors , 2011 .
[29] Babak Falsafi,et al. BuMP: Bulk Memory Access Prediction and Streaming , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[30] Scott A. Mahlke,et al. Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.
[31] Onur Mutlu,et al. Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.
[32] Shirley Moore,et al. Non-determinism and overcount on modern hardware performance counter implementations , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[33] Margaret Martonosi,et al. MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[34] Gabor Pataki,et al. Basis reduction and the complexity of branch-and-bound , 2009, SODA '10.
[35] J. Demmel,et al. Sun Microsystems , 1996 .
[36] Jonathan Harris,et al. Extending OpenMP For NUMA Machines , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[37] Zhen Fang,et al. Active memory controller , 2012, The Journal of Supercomputing.
[38] Hendrik W. Lenstra,et al. Integer Programming with a Fixed Number of Variables , 1983, Math. Oper. Res..
[39] Xipeng Shen,et al. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? , 2010, PPoPP '10.
[40] J. Teugels,et al. Statistics of Extremes , 2004 .
[41] Alexandra Fedorova,et al. A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[42] Scott A. Mahlke,et al. When less is more (LIMO):controlled parallelism forimproved efficiency , 2012, CASES '12.
[43] Gurindar S. Sohi,et al. Adaptive, efficient, parallel execution of parallel programs , 2014, PLDI.
[44] Carl Staelin,et al. lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.
[45] S SohiGurindar,et al. Adaptive, efficient, parallel execution of parallel programs , 2014 .
[46] Mary Lou Soffa,et al. DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[47] Babak Falsafi,et al. Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.
[48] Jinkyu Jeong,et al. A fully associative, tagless DRAM cache , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[49] Carl Staelin,et al. Memory hierarchy performance measurement of commercial dual-core desktop processors , 2008, J. Syst. Archit..
[50] Nathan Clark,et al. Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications , 2010, ISCA.
[51] Karthikeyan Sankaralingam,et al. A general constraint-centric scheduling framework for spatial architectures , 2013, PLDI.
[52] Laxmi N. Bhuyan,et al. Thread reinforcer: Dynamically determining number of threads via OS level monitoring , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).
[53] Bruce R. Childers,et al. Inflation and deflation of self-adaptive applications , 2011, SEAMS '11.
[54] Francisco J. Cazorla,et al. Optimal task assignment in multithreaded processors: a statistical approach , 2012, ASPLOS XVII.
[55] Harvey M. Wagner,et al. An integer linear‐programming model for machine scheduling , 1959 .