论文信息 - Design and analysis of spatially-partitioned shared caches

Design and analysis of spatially-partitioned shared caches

Data movement is a growing problem in modern chip-multiprocessors (CMPs). Processors spend the majority of their time, energy, and area moving data, not processing it. For example, a single main memory access takes hundreds of cycles and costs the energy of a thousand floating-point operations. Data movement consumes more than half the energy in current processors, and CMPs devote more than half their area to on-chip caches. Moreover, these costs are increasing as CMPs scale to larger core counts. Processors rely on the on-chip caches to limit data movement, but CMP cache design is challenging. For efficiency reasons, most cache capacity is shared among cores and distributed in banks throughout the chip. Distribution makes cores sensitive to data placement, since some cache banks can be accessed at lower latency and lower energy than others. Yet because applications require sufficient capacity to fit their working sets, it is not enough to just use the closest cache banks. Meanwhile, cores compete for scarce capacity, and the resulting interference, left unchecked, produces many unnecessary cache misses. This thesis presents novel architectural techniques that navigate these complex tradeoffs and reduce data movement. First, virtual caches spatially partition the shared cache banks to fit applications' working sets near where they are used. Virtual caches expose the distributed banks to software, and let the operating system schedule threads and their working sets to minimize data movement. Second, analytical replacement policies make better use of scarce cache capacity, reducing expensive main memory accesses: Talus eliminates performance cliffs by guaranteeing convex performance, and EVA uses planning theory to derive the optimal replacement metric under uncertainty. These policies improve performance and make qualitative contributions: Talus is cheap to predict, and so lets cache partitioning techniques (including virtual caches) work with high-performance cache replacement; and EVA shows that the conventional approach to practical cache replacement is sub-optimal. Designing CMP caches is difficult because architects face many options with many interacting factors. Unlike most prior caching work that employs best-effort heuristics, we reason about the tradeoffs through analytical models. This analytical approach lets us achieve the performance and efficiency of application-specific designs across a broad range of applications, while further providing a coherent theoretical framework to reason about data movement. Compared to a 64-core CMP with a conventional cache design, these techniques improve end-to-end performance by up to 76% and an average of 46%, save 36% of system energy, and reduce …

Nathan Beckmann | Nathan Beckmann

[1] Anant Agarwal,et al. Core Count vs Cache Size for Manycore Architectures in the Cloud , 2010 .

[2] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[3] David H. Albonesi,et al. Selective cache ways: on-demand cache resource allocation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[4] Yan Solihin,et al. A Framework for Providing Quality of Service in Chip Multi-Processors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[5] 吉田利雄. SPARC64 XIfx: Fujitsu's Next Generation Processor for HPC , 2014 .

[6] David A. Wood,et al. A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches , 1994, IEEE Trans. Computers.

[7] Mark D. Hill,et al. Virtual hierarchies to support server consolidation , 2007, ISCA '07.

[8] Irving L. Traiger,et al. Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[9] Avraham A. Melkman,et al. On-Line Construction of the Convex Hull of a Simple Polyline , 1987, Inf. Process. Lett..

[10] Zeshan Chishti,et al. Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[11] Efraim Rotem,et al. Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge , 2012, IEEE Micro.

[12] Donald Yeung,et al. Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.

[13] Thomas R. Gross,et al. Memory system performance in a NUMA multicore multiprocessor , 2011, SYSTOR '11.

[14] David A. Wood,et al. IPC Considered Harmful for Multiprocessor Workloads , 2006, IEEE Micro.

[15] Donald E. Knuth,et al. Axioms and Hulls , 1992, Lecture Notes in Computer Science.

[16] Hsien-Hsin S. Lee,et al. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[17] Onur Mutlu,et al. Asymmetry-aware execution placement on manycore chips , 2013 .

[18] Christoforos E. Kozyrakis,et al. ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[19] Lei Jiang,et al. Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[20] Christopher R. Johnson,et al. PIKA: A Network Service for Multikernel Operating Systems , 2014 .

[21] Karthikeyan Sankaralingam,et al. A general constraint-centric scheduling framework for spatial architectures , 2013, PLDI.

[22] Luiz André Barroso,et al. The tail at scale , 2013, CACM.

[23] Babak Falsafi,et al. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[24] Valentin Puente,et al. ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[25] Dean M. Tullsen,et al. Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.

[26] Mark Horowitz,et al. An analytical cache model , 1989, TOCS.

[27] Hyunjin Lee,et al. CloudCache: Expanding and shrinking private caches , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[28] M TullsenDean,et al. Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000 .

[29] Christoforos E. Kozyrakis,et al. Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[30] Sanjeev Kumar,et al. Dynamic tracking of page miss ratio curve for memory management , 2004, ASPLOS XI.

[31] Mateo Valero,et al. Improving Cache Management Policies Using Dynamic Reuse Distances , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[32] Armand M. Makowski,et al. Optimal replacement policies for nonuniform cache objects with optional eviction , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[33] Peter J. Denning,et al. Thrashing: its causes and prevention , 1968, AFIPS Fall Joint Computing Conference.

[34] Yutao Zhong,et al. Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[35] Reetuparna Das,et al. Application-to-core mapping policies to reduce memory system interference in multi-core systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[36] Kei Hiraki,et al. Inter-reference gap distribution replacement: an improved replacement algorithm for set-associative caches , 2004, ICS '04.

[37] Sangyeun Cho,et al. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[38] Easwaran Raman,et al. MAO — An extensible micro-architectural optimizer , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[39] Nathan Beckmann,et al. A Cache Model for Modern Processors , 2015 .

[40] Norman P. Jouppi,et al. Reconfigurable caches and their application to media processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[41] Kai Li,et al. The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[42] Larry Carter,et al. Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[43] Lizhong Chen,et al. Futility Scaling: High-Associativity Cache Partitioning , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[44] Jean Roman,et al. SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.

[45] Saman P. Amarasinghe,et al. Maps: a compiler-managed memory system for Raw machines , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[46] Jaehyuk Huh,et al. A NUCA Substrate for Flexible CMP Cache Sharing , 2007, IEEE Transactions on Parallel and Distributed Systems.

[47] Benoît Dupont de Dinechin,et al. A clustered manycore processor architecture for embedded and accelerated applications , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[48] Aamer Jaleel,et al. CRUISE: cache replacement and utility-aware scheduling , 2012, ASPLOS XVII.

[49] Nathan Beckmann,et al. Jigsaw: Scalable software-defined caches , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[50] Anne Condon,et al. On the Undecidability of Probabilistic Planning and Infinite-Horizon Partially Observable Markov Decision Problems , 1999, AAAI/IAAI.

[51] Alfred V. Aho,et al. Principles of Optimal Page Replacement , 1971, J. ACM.

[52] Yuan Xie,et al. Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[53] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[54] Yan Solihin,et al. QoS policies and architecture for cache/memory in CMP platforms , 2007, SIGMETRICS '07.

[55] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[56] Li-Shiuan Peh,et al. CoQoS: Coordinating QoS-aware shared resources in NoC-based SoCs , 2011, J. Parallel Distributed Comput..

[57] William J. Dally,et al. Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures , 2010, SPAA '10.

[58] Daniel Sánchez,et al. Ubik: efficient cache sharing with strict qos for latency-critical workloads , 2014, ASPLOS.

[59] Anant Agarwal,et al. An operating system for multicore and clouds: mechanisms and implementation , 2010, SoCC '10.

[60] Mikko H. Lipasti,et al. Tag tables , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[61] M. Javidi,et al. Iterative methods to nonlinear equations , 2007, Appl. Math. Comput..

[62] K. Steinhubl. Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .

[63] Per Stenström,et al. An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[64] Moinuddin K. Qureshi. Adaptive Spill-Receive for robust high-performance caching in CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[65] José González,et al. Distributed Cooperative Caching , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[66] André Seznec,et al. A case for two-way skewed-associative caches , 1993, ISCA '93.

[67] William J. Dally,et al. SLIP: Reducing wire energy in the memory hierarchy , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[68] David A. Patterson,et al. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness , 2013, ISCA.

[69] Christoforos E. Kozyrakis,et al. Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[70] Christoforos E. Kozyrakis,et al. The ZCache: Decoupling Ways and Associativity , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[71] Aamer Jaleel,et al. Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[72] Charles M. Grinstead,et al. Introduction to probability , 1999, Statistics for the Behavioural Sciences.

[73] Aamer Jaleel,et al. The gradient-based cache partitioning algorithm , 2012, TACO.

[74] Vipin Kumar,et al. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[75] Xiao Zhang,et al. CPI2: CPU performance isolation for shared compute clusters , 2013, EuroSys '13.

[76] Onur Mutlu,et al. Exploiting compressed block size as an indicator of future reuse , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[77] Richard B. Brown,et al. Congestion driven quadratic placement , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[78] Jack Doweck,et al. Inside Intel® Core microarchitecture , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[79] Aamer Jaleel,et al. Adaptive insertion policies for managing shared caches , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[80] Robert E. Tarjan,et al. Amortized efficiency of list update and paging rules , 1985, CACM.

[81] Aamer Jaleel,et al. Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[82] Guy E. Blelloch,et al. Brief announcement: the problem based benchmark suite , 2012, SPAA '12.

[83] Mahmut T. Kandemir,et al. A case for integrated processor-cache partitioning in chip multiprocessors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[84] George Kurian,et al. ATAC: Improving performance and programmability with on-chip optical networks , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[85] Babak Falsafi,et al. Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[86] Srinivas Devadas,et al. Dynamic Cache Partitioning for CMP/SMT Systems , 2004 .

[87] George Kurian,et al. Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[88] Mark D. Hill,et al. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[89] Gabriel H. Loh,et al. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[90] Srinivas Devadas,et al. Application-specific memory management for embedded systems using software-controlled caches , 2000, Proceedings 37th Design Automation Conference.

[91] Yan Solihin,et al. CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[92] Gabriel H. Loh,et al. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[93] Randolph Kirchain,et al. A roadmap for nanophotonics , 2007 .

[94] Sarah Bird. PACORA : Performance Aware Convex Optimization for Resource Allocation , 2011 .

[95] Koen De Bosschere,et al. XOR-based hash functions , 2005, IEEE Transactions on Computers.

[96] Aamer Jaleel,et al. High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[97] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[98] Bharadwaj S. Amrutur,et al. Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[99] Lingjia Tang,et al. Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[100] Krste Asanovic,et al. Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[101] I. Newton. Philosophiæ naturalis principia mathematica , 1973 .

[102] Anoop Gupta,et al. Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[103] Emery D. Berger,et al. STABILIZER: statistically sound performance evaluation , 2013, ASPLOS '13.

[104] Matthias Hauswirth,et al. Producing wrong data without doing anything obviously wrong! , 2009, ASPLOS.

[105] Stefanos Kaxiras,et al. Where replacement algorithms fail: a thorough analysis , 2010, CF '10.

[106] Amir Roth,et al. FIESTA: A Sample-Balanced Multi-Program Workload Methodology , 2009 .

[107] Anant Agarwal,et al. The case for elastic operating system services in fos , 2012, DAC Design Automation Conference 2012.

[108] Laszlo A. Belady,et al. A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[109] George Kurian,et al. Efficient Cache Coherence on Manycore Optical Networks , 2010 .

[110] Christoforos E. Kozyrakis,et al. Comparing memory systems for chip multiprocessors , 2007, ISCA '07.

[111] Christopher Mozak,et al. Westmere: A family of 32nm IA processors , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[112] Alexandra Fedorova,et al. A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[113] George E. Monahan,et al. A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 2007 .

[114] Parag Agrawal,et al. The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[115] Margaret Martonosi,et al. Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[116] Anant Agarwal,et al. A Unified Operating System for Clouds and Manycore: fos , 2009 .

[117] D. F. Wong,et al. Simulated Annealing for VLSI Design , 1988 .

[118] David A. Wood,et al. Reuse-based online models for caches , 2013, SIGMETRICS '13.

[119] David A. Wood,et al. Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[120] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[121] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[122] Francisco J. Cazorla,et al. FlexDCP: a QoS framework for CMP architectures , 2009, OPSR.

[123] Daniel Sánchez,et al. Talus: A simple way to remove cliffs in cache performance , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[124] Varghese George,et al. Power management of the third generation intel core micro architecture formerly codenamed ivy bridge , 2012, 2012 IEEE Hot Chips 24 Symposium (HCS).

[125] Doug Burger,et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[126] P. Spreij. Probability and Measure , 1996 .

[127] M. Martonosi,et al. A Comparison of Capacity Management Schemes for Shared CMP Caches , 2008 .

[128] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[129] Jichuan Chang,et al. Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[130] Zhe Wang,et al. Decoupled dynamic cache segmentation , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[131] Stijn Eyerman,et al. System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[132] David A. Wood,et al. ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[133] Dan Page,et al. Partitioned Cache Architecture as a Side-Channel Defence Mechanism , 2005, IACR Cryptology ePrint Archive.

[134] Carole-Jean Wu,et al. SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[135] Mahmut T. Kandemir,et al. SHARP control: Controlled shared cache management in chip multiprocessors , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[136] Michael Stumm,et al. RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations , 2009, ASPLOS.

[137] José González,et al. Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors , 2010, ISCA.

[138] Alexandra Fedorova,et al. Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[139] Daniel A. Jiménez. Insertion and promotion for tree-based PseudoLRU last-level caches , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[140] R. Govindarajan,et al. Probabilistic Shared Cache Management (PriSM) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[141] Michael Stumm,et al. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.

[142] Babak Falsafi,et al. Database Servers on Chip Multiprocessors: Limitations and Opportunities , 2007, CIDR.

[143] David K. Tam,et al. Managing Shared L2 Caches on Multicore Systems in Software , 2007 .

[144] Vikas Agarwal,et al. Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[145] Hans J. Herrmann,et al. Geometrical cluster growth models and kinetic gelation , 1986 .

[146] Stefanos Kaxiras,et al. Cache replacement based on reuse-distance prediction , 2007, 2007 25th International Conference on Computer Design.

[147] Belliappa Kuttanna,et al. A Sub-1W to 2W Low-Power IA Processor for Mobile Internet Devices and Ultra-Mobile PCs in 45nm Hi-Κ Metal Gate CMOS , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[148] Jaejin Lee,et al. Using prime numbers for cache indexing to eliminate conflict misses , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[149] Zhao Zhang,et al. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[150] Kevin Skadron,et al. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[151] Vijay S. Pai,et al. Imbalanced cache partitioning for balanced data-parallel programs , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[152] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[153] Thomas F. Wenisch,et al. Unlocking bandwidth for GPUs in CC-NUMA systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[154] Daniel Sánchez,et al. Scaling distributed cache hierarchies through computation and data co-scheduling , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[155] Vivien Quéma,et al. Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.