Modeling memory system performance of NUMA multicore-multiprocessors
暂无分享,去创建一个
[1] Stéphane Eranian. What can performance counters do for memory subsystem analysis? , 2008, MSPC '08.
[2] Wolfgang E. Nagel,et al. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[3] Ramesh Illikkal,et al. Rate-based QoS techniques for cache/memory in CMP platforms , 2009, ICS.
[4] Alexandra Fedorova,et al. A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[5] Ian Pratt,et al. Multiprogramming Performance of the Pentium 4 with Hyper-Threading , 2004 .
[6] Lingjia Tang,et al. Directly characterizing cross core interference through contention synthesis , 2011, HiPEAC.
[7] Jie Chen,et al. Analysis and approximation of optimal co-scheduling on Chip Multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[8] Tarek A. El-Ghazawi,et al. An evaluation of global address space languages: co-array fortran and unified parallel C , 2005, PPoPP.
[9] Margo I. Seltzer,et al. Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design , 2005, USENIX Annual Technical Conference, General Track.
[10] Vivien Quéma,et al. MemProf: A Memory Profiler for NUMA Multicore Systems , 2012, USENIX Annual Technical Conference.
[11] Stijn Eyerman,et al. Probabilistic job symbiosis modeling for SMT processor scheduling , 2010, ASPLOS XV.
[12] Yale N. Patt,et al. Feedback-directed pipeline parallelism , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[13] Michael Stumm,et al. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.
[14] Xiao Zhang,et al. Optimizing Google's warehouse scale computers: The NUMA experience , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[15] Scott A. Mahlke,et al. Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.
[16] Michael I. Gordon,et al. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.
[17] Carole-Jean Wu,et al. Characterization and dynamic mitigation of intra-application cache interference , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.
[18] Dean M. Tullsen,et al. Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.
[19] Frank Mueller,et al. Feedback-directed page placement for ccNUMA via hardware-generated memory traces , 2010, J. Parallel Distributed Comput..
[20] Uday Bondhugula,et al. Effective automatic computation placement and data allocation for parallelization of regular programs , 2014, ICS '14.
[21] Zachary R. Anderson. Efficiently combining parallel software using fine-grained, language-level, hierarchical resource management policies , 2012, OOPSLA '12.
[22] Robert Tappan Morris,et al. Locating cache performance bottlenecks using data profiling , 2010, EuroSys '10.
[23] Kirk W. Cameron,et al. Critical path-based thread placement for NUMA systems , 2011, PMBS '11.
[24] Mahmut T. Kandemir,et al. Cache topology aware computation mapping for multicores , 2010, PLDI '10.
[25] Manuel Prieto,et al. A comprehensive scheduler for asymmetric multicore systems , 2010, EuroSys '10.
[26] Mahmut T. Kandemir,et al. Adaptive set pinning: managing shared caches in chip multiprocessors , 2008, ASPLOS.
[27] Eduard Ayguadé,et al. A case for user-level dynamic page migration , 2000, ICS '00.
[28] Eduard Ayguadé,et al. Is Data Distribution Necessary in OpenMP? , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[29] Michael L. Scott,et al. Simple but effective techniques for NUMA memory management , 1989, SOSP '89.
[30] Dheeraj Reddy,et al. Bias scheduling in heterogeneous multi-core architectures , 2010, EuroSys '10.
[31] Jonathan Harris,et al. Extending OpenMP For NUMA Machines , 2000, ACM/IEEE SC 2000 Conference (SC'00).
[32] Mary Lou Soffa,et al. Contention aware execution: online contention detection and response , 2010, CGO '10.
[33] Thomas R. Gross,et al. Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead , 2011, ISMM '11.
[34] Christoph Lameter,et al. Local and Remote Memory: Memory in a Linux/NUMA System , 2006 .
[35] Robert J. Fowler,et al. Modeling memory concurrency for multi-socket multi-core systems , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[36] Quan Chen,et al. CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures , 2012, ICS '12.
[37] Matthias Hauswirth,et al. Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.
[38] Alexandra Fedorova,et al. Contention-Aware Scheduling on Multicore Systems , 2010, TOCS.
[39] Michael D. Smith,et al. Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).
[40] Vivek Sarkar,et al. X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.
[41] Wei Wang,et al. ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity , 2013, ACM Trans. Archit. Code Optim..
[42] Jeffrey K. Hollingsworth,et al. NUMA-aware Java heaps for server applications , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[43] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.
[44] Pradeep Dubey,et al. Can traditional programming bridge the Ninja performance gap for parallel computing applications? , 2015, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[45] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[46] A. Snavely,et al. Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.
[47] Xipeng Shen,et al. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? , 2010, PPoPP '10.
[48] Kenneth M. Wilson,et al. Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C , 2001, SC.
[49] David A. Padua,et al. Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.
[50] Nir Shavit,et al. Flat-combining NUMA locks , 2011, SPAA '11.
[51] Rafael Asenjo,et al. Analytical Modeling of Pipeline Parallelism , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.
[52] Jeffrey K. Hollingsworth,et al. Hardware monitors for dynamic page migration , 2008, J. Parallel Distributed Comput..
[53] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[54] James Charles,et al. Evaluation of the Intel® Core™ i7 Turbo Boost feature , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[55] Frank Mueller,et al. Hardware profile-guided automatic page placement for ccNUMA systems , 2006, PPoPP '06.
[56] Michael Frumkin,et al. The OpenMP Implementation of NAS Parallel Benchmarks and its Performance , 2013 .
[57] Matthias S. Müller,et al. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.
[58] Robert J. Fowler,et al. Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations , 2003, J. Parallel Distributed Comput..
[59] Idit Keidar,et al. SALSA: scalable and low synchronization NUMA-aware algorithm for producer-consumer pools , 2012, SPAA '12.
[60] Benjamin Hindman,et al. Composing parallel software efficiently with lithe , 2010, PLDI '10.
[61] Christian Bienia,et al. Benchmarking modern multiprocessors , 2011 .
[62] David Eklov,et al. Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[63] Dean M. Tullsen,et al. Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.
[64] Stijn Eyerman,et al. System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.
[65] Zhe Wang,et al. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.
[66] Mohammad Banikazemi,et al. PAM: A novel performance/power aware meta-scheduler for multi-core systems , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[67] Tarek S. Abdelrahman,et al. Automatic partitioning of data and computations on scalable shared memory multiprocessors , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).
[68] Erez Petrank,et al. Wait-free queues with multiple enqueuers and dequeuers , 2011, PPoPP '11.
[69] John M. Mellor-Crummey,et al. A tool to analyze the performance of multithreaded programs on NUMA architectures , 2014, PPoPP '14.
[70] Michael Stumm,et al. RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.
[71] Lingjia Tang,et al. Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures , 2011, EXADAPT '11.
[72] John M. Mellor-Crummey,et al. Pinpointing data locality problems using data-centric analysis , 2011, International Symposium on Code Generation and Optimization (CGO 2011).
[73] Tong Li,et al. Efficient operating system scheduling for performance-asymmetric multi-core architectures , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).
[74] Guy E. Blelloch,et al. The Data Locality of Work Stealing , 2002, SPAA '00.
[75] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[76] Keshav Pingali,et al. Access normalization: loop restructuring for NUMA compilers , 1992, ASPLOS V.
[77] Anoop Gupta,et al. Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.
[78] Yan Solihin,et al. Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.
[79] Vana Kalogeraki,et al. FACT: a framework for adaptive contention-aware thread migrations , 2011, CF '11.
[80] T. Gross,et al. Asymmetries in Multi-Core Systems – Or Why We Need Better Performance Measurement Units , 2010 .
[81] Alexandra Fedorova,et al. Deconstructing the overhead in parallel applications , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).
[82] Aleksandar Milenkovic,et al. Demystifying Intel Branch Predictors , 2005 .
[83] Alexandra Fedorova,et al. Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.
[84] Rui Yang,et al. Memory and Thread Placement Effects as a Function of Cache Usage: A Study of the Gaussian Chemistry Code on the SunFire X4600 M2 , 2008, 2008 International Symposium on Parallel Architectures, Algorithms, and Networks (i-span 2008).
[85] Yehuda Afek,et al. Fast concurrent queues for x86 processors , 2013, PPoPP '13.
[86] Vivien Quéma,et al. Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.
[87] Nenad Nedeljkovic,et al. Data distribution support on distributed shared memory multiprocessors , 1997, PLDI '97.
[88] David W. Nellans,et al. Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[89] Jeffrey K. Hollingsworth,et al. Using Hardware Counters to Automatically Improve Memory Performance , 2004, Proceedings of the ACM/IEEE SC2004 Conference.
[90] Christoforos E. Kozyrakis,et al. Dynamic Fine-Grain Scheduling of Pipeline Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[91] Hui Li,et al. Locality and Loop Scheduling on NUMA Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.
[92] Susan J. Eggers,et al. Impact of sharing-based thread placement on multithreaded architectures , 1994, ISCA '94.
[93] Michael Voss,et al. Optimization via Reflection on Work Stealing in TBB , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[94] Yi Guo,et al. SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[95] Michael Stumm,et al. Online performance analysis by statistical sampling of microprocessor performance counters , 2005, ICS '05.
[96] Takeshi Ogasawara. NUMA-aware memory manager with dominant-thread-based copying GC , 2009, OOPSLA 2009.
[97] Collin McCurdy,et al. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).
[98] Tong Li,et al. Using OS Observations to Improve Performance in Multicore Systems , 2008, IEEE Micro.