Design guidelines for high-performance SCM hierarchies

With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of memory-resident services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to improve overall performance/cost over existing DRAM-only architectures. We first show that even with the most optimistic latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the performance of an SCM-mostly memory system competitive. The high degree of spatial locality that memory-resident services exhibit not only simplifies the DRAM cache's design as page-based, but also enables the amortization of increased SCM access latencies and the mitigation of SCM's read/write latency disparity. We identify the set of memory hierarchy design parameters that plays a key role in the performance and cost of a memory system combining an SCM technology and a 3D stacked DRAM cache. We then introduce a methodology to drive provisioning for each of these design parameters under a target performance/cost goal. Finally, we use our methodology to derive concrete results for specific SCM technologies. With PCM as a case study, we show that a two bits/cell technology hits the performance/cost sweet spot, reducing the memory subsystem cost by 40% while keeping performance within 3% of the best performing DRAM-only system, whereas single-level and triple-level cell organizations are impractical for use as memory replacements.

[1]  Jian Yang,et al.  Mojim: A Reliable and Highly-Available Non-Volatile Memory System , 2015, ASPLOS.

[2]  Dmitri B. Strukov,et al.  Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[3]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[4]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[5]  Aamer Jaleel,et al.  CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[6]  Lizy Kurian John,et al.  The virtual write queue: coordinating DRAM and last-level cache policies , 2010, ISCA.

[7]  Amirsaman Memaripour,et al.  HNVM : Hybrid NVM Enabled Datacenter Design and Optimization Yanqi , 2017 .

[8]  Moinuddin K. Qureshi,et al.  DICE: Compressing DRAM caches for bandwidth and capacity , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[9]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[10]  Babak Falsafi,et al.  The Case for RackOut: Scalable Data Serving Using Rack-Scale Systems , 2016, SoCC.

[11]  Frederic T. Chong,et al.  Balancing Performance and Lifetime of MLC PCM by Using a Region Retention Monitor , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12]  Jun Yang,et al.  FPB: Fine-grained Power Budgeting to Improve Write Throughput of Multi-level Cell Phase Change Memory , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[13]  Charles Zhang Mars: A 64-core ARMv8 processor , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[14]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[15]  Onur Mutlu,et al.  FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[16]  Michael M. Swift,et al.  An Analysis of Persistent Memory Use with WHISPER , 2017, ASPLOS.

[17]  Moinuddin K. Qureshi,et al.  Morphable memory system: a robust architecture for exploiting multi-level phase change memories , 2010, ISCA.

[18]  Norman P. Jouppi,et al.  Understanding the trade-offs in multi-level cell ReRAM memory design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[19]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[20]  Mohammad Arjomand,et al.  Boosting Access Parallelism to PCM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[21]  Karsten Schwan,et al.  Data tiering in heterogeneous memory systems , 2016, EuroSys.

[22]  Stratis Viglas,et al.  Write-limited sorts and joins for persistent memory , 2014, Proc. VLDB Endow..

[23]  Christoforos E. Kozyrakis,et al.  Memory Hierarchy for Web Search , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24]  Mark D. Hill,et al.  Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Thomas F. Wenisch,et al.  SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.

[26]  Weimin Zheng,et al.  DudeTM: Building Durable Transactions with Decoupling for Persistent Memory , 2017, ASPLOS.

[27]  Jinkyu Jeong,et al.  A fully associative, tagless DRAM cache , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[28]  Miguel Castro,et al.  No compromises: distributed transactions with consistency, availability, and performance , 2015, SOSP.

[29]  Rachata Ausavarungnirun,et al.  Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[30]  Yuan Xie,et al.  Fabrication Cost Analysis and Cost-Aware Design Space Exploration for 3-D ICs , 2010, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[31]  Yan Solihin,et al.  CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[32]  Gabriel H. Loh,et al.  Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[33]  Babak Falsafi,et al.  Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[34]  Yuan Xie,et al.  Building and Optimizing MRAM-Based Commodity Memories , 2014, ACM Trans. Archit. Code Optim..

[35]  E. Eleftheriou,et al.  Demonstration of Reliable Triple-Level-Cell (TLC) Phase-Change Memory , 2016, 2016 IEEE 8th International Memory Workshop (IMW).

[36]  Cheng-Chieh Huang,et al.  ATCache: Reducing DRAM cache latency via a small SRAM tag cache , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[37]  Aamer Jaleel,et al.  CANDY: Enabling coherent DRAM caches for multi-node systems , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Prashant J. Shenoy,et al.  Blink: managing server clusters on intermittent power , 2011, ASPLOS XVI.

[39]  Emin Gün Sirer,et al.  Beehive: O(1) Lookup Performance for Power-Law Query Distributions in Peer-to-Peer Overlays , 2004, NSDI.

[40]  Aamer Jaleel,et al.  BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[41]  Anirudh Badam,et al.  Viyojit: Decoupling battery and DRAM capacities for battery-backed DRAM , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[42]  Babak Falsafi,et al.  BuMP: Bulk Memory Access Prediction and Streaming , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[43]  Suman Nath,et al.  Rethinking Database Algorithms for Phase Change Memory , 2011, CIDR.

[44]  Bin Fan,et al.  Small cache, big effect: provable load balancing for randomly partitioned cluster services , 2011, SoCC.

[45]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[46]  Ryousei Takano,et al.  RAMinate: Hypervisor-based Virtualization for Hybrid Main Memory Systems , 2016, SoCC.

[47]  Moinuddin K. Qureshi,et al.  Improving read performance of Phase Change Memories via Write Cancellation and Write Pausing , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[48]  Amirsaman Memaripour,et al.  Hybrid NVM Enabled Datacenter Design and Optimization , 2017 .

[49]  Thomas F. Wenisch,et al.  Thermostat: Application-transparent Page Management for Two-tiered Main Memory , 2017, ASPLOS.

[50]  Aamer Jaleel,et al.  BATMAN: techniques for maximizing system bandwidth of memory systems with stacked-DRAM , 2017, MEMSYS.

[51]  Timothy G. Armstrong,et al.  LinkBench: a database benchmark based on the Facebook social graph , 2013, SIGMOD '13.

[52]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[53]  Babak Falsafi,et al.  Fat Caches for Scale-Out Servers , 2017, IEEE Micro.

[54]  Rachid Guerraoui,et al.  FloDB: Unlocking Memory in Persistent Key-Value Stores , 2017, EuroSys.

[55]  Kimberly Keeton,et al.  Memory-Driven Computing , 2017, FAST.

[56]  A. L. Narasimha Reddy,et al.  ARI: Adaptive LLC-memory traffic management , 2013, TACO.

[57]  Jacob Nelson,et al.  Approximate storage in solid-state memories , 2013, MICRO-46.

[58]  Pablo Rodriguez,et al.  I tube, you tube, everybody tubes: analyzing the world's largest user generated content video system , 2007, IMC '07.

[59]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[60]  Karin Strauss,et al.  Preventing PCM banks from seizing too much power , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[61]  Steven Swanson,et al.  A study of application performance with non-volatile main memory , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[62]  Thomas F. Wenisch,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, ISCA '03.

[63]  Scott Shenker,et al.  Network Requirements for Resource Disaggregation , 2016, OSDI.