Symmetry-Agnostic Coordinated Management of the Memory Hierarchy in Multicore Systems

In a multicore system, many applications share the last-level cache (LLC) and memory bandwidth. These resources need to be carefully managed in a coordinated way to maximize performance. DRAM is still the technology of choice in most systems. However, as traditional DRAM technology faces energy, reliability, and scalability challenges, nonvolatile memory (NVM) technologies are gaining traction. While DRAM is read/write symmetric (a read operation has comparable latency and energy consumption as a write operation), many NVM technologies (such as Phase-Change Memory, PCM) experience read/write asymmetry: write operations are typically much slower and more power hungry than read operations. Whether the memory’s characteristics are symmetric or asymmetric influences the way shared resources are managed. We propose two symmetry-agnostic schemes to manage a shared LLC through way partitioning and memory through bandwidth allocation. The proposals work well for both symmetric and asymmetric memory. First, an exhaustive search is proposed to find the best combination of a cache way partition and bandwidth allocation. Second, an approximate scheme, derived from a theoretical model, is proposed without the overhead of exhaustive search. Simulation results show that the approximate scheme improves weighted speedup by at least 14% on average (regardless of the memory symmetry) over a state-of-the-art way partitioning and memory bandwidth allocation. Simulation results also show that the approximate scheme achieves comparable weighted speedup as a state-of-the-art multiple resource management scheme, XChange, for symmetric memory, and outperforms it by an average of 10% for asymmetric memory.

[1]  Onur Mutlu,et al.  MISE: Providing performance predictability and improving fairness in shared main memory systems , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[2]  Donald Yeung,et al.  Learning-Based SMT Processor Resource Distribution via Hill-Climbing , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[3]  Francisco J. Cazorla,et al.  Multicore Resource Management , 2008, IEEE Micro.

[4]  Rami G. Melhem,et al.  Real-Time Scheduling for Phase Change Main Memory Systems , 2011, 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications.

[5]  Norman P. Jouppi,et al.  Staged Reads: Mitigating the impact of DRAM writes on DRAM reads , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[6]  David A. Patterson,et al.  A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness , 2013, ISCA.

[7]  Rami G. Melhem,et al.  Bit mapping for balanced PCM cell programming , 2013, ISCA.

[8]  Christoforos E. Kozyrakis,et al.  Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[9]  Fang Liu,et al.  Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[10]  Ying Ye,et al.  COLORIS: A dynamic cache partitioning system using page coloring , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[11]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[12]  Duane Mills,et al.  19.7 A 16Gb ReRAM with 200MB/s write and 1GB/s read in 27nm technology , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[13]  Hyunjin Lee,et al.  Flip-N-Write: A simple deterministic technique to improve PRAM write performance, energy and endurance , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Philip G. Emma,et al.  Understanding some simple processor-performance limits , 1997, IBM J. Res. Dev..

[15]  Chenjie Yu,et al.  Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms , 2010, Design Automation Conference.

[16]  Byung-Gil Choi,et al.  A 0.1/spl mu/m 1.8V 256Mb 66MHz Synchronous Burst PRAM , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[17]  Richard W. Vuduc,et al.  When Prefetching Works, When It Doesn’t, and Why , 2012, TACO.

[18]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[19]  Lizhong Chen,et al.  An Analytical Performance Model for Partitioning Off-Chip Memory Bandwidth , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[20]  Rami G. Melhem,et al.  Increasing PCM main memory lifetime , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[21]  Tao Li,et al.  Exploring Phase Change Memory and 3D Die-Stacking for Power/Thermal Friendly, Fast and Durable Memory Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[22]  Jun Yang,et al.  A durable and energy efficient main memory using phase change memory technology , 2009, ISCA '09.

[23]  Rami G. Melhem,et al.  Delta-compressed caching for overcoming the write bandwidth limitation of hybrid main memory , 2013, TACO.

[24]  Francisco J. Cazorla,et al.  FlexDCP: a QoS framework for CMP architectures , 2009, OPSR.

[25]  Rami G. Melhem,et al.  Using PCM in Next-generation Embedded Space Applications , 2010, 2010 16th IEEE Real-Time and Embedded Technology and Applications Symposium.

[26]  Cong Xu,et al.  Bandwidth-aware reconfigurable cache design with hybrid memory technologies , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[27]  Yale N. Patt,et al.  Predicting Performance Impact of DVFS for Realistic Memory Systems , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[28]  Francisco Javier Cazorla Almeida,et al.  MLP-aware dynamic cache partitioning , 2007, PACT 2007.

[29]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[30]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[31]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[32]  Rami G. Melhem,et al.  Writeback-aware bandwidth partitioning for multi-core systems with PCM , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[33]  Manoj Franklin,et al.  Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[34]  Yong Luo,et al.  Development and validation of a hierarchical memory model incorporating CPU- and memory-operation overlap model , 1998, WOSP '98.

[35]  Lizy Kurian John,et al.  Predictive coordination of multiple on-chip resources for chip multiprocessors , 2011, ICS '11.

[36]  Rami G. Melhem,et al.  Writeback-aware partitioning and replacement for last-level caches in phase change main memory systems , 2012, TACO.

[37]  Sanjay Ranka,et al.  Dynamic cache reconfiguration and partitioning for energy optimization in real-time multi-core systems , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[38]  James E. Smith,et al.  Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[39]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[40]  Dean M. Tullsen,et al.  Symbiotic jobscheduling with priorities for a simultaneous multithreading processor , 2002, SIGMETRICS '02.

[41]  Lizy Kurian John,et al.  The virtual write queue: coordinating DRAM and last-level cache policies , 2010, ISCA.

[42]  O. Mutlu,et al.  Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS XV.

[43]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[44]  Xiaowei Li,et al.  Wear rate leveling: Lifetime enhancement of PRAM with endurance variation , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[45]  Moinuddin K. Qureshi,et al.  Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[46]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[47]  Francisco J. Cazorla,et al.  MLP-Aware Dynamic Cache Partitioning , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[48]  John Kubiatowicz,et al.  Tessellation: Refactoring the OS around explicit resource containers with continuous adaptation , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[49]  Xiaodong Wang,et al.  XChange: A market-based approach to scalable dynamic multi-resource allocation in multicore architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[50]  Engin Ipek,et al.  Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[51]  Mahmut T. Kandemir,et al.  Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[52]  Pradip Bose,et al.  Crank it up or dial it down: Coordinated multiprocessor frequency and folding control , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[53]  Vijayalakshmi Srinivasan,et al.  Enhancing lifetime and security of PCM-based Main Memory with Start-Gap Wear Leveling , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[54]  G. Edward Suh,et al.  Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.