A Case for Asymmetric Non-Volatile Memory Architecture

The byte-addressable Non-Volatile Memory (NVM) is a promising technology since it simultaneously provides DRAM-like performance, disk-like capacity, and persistency. The current NVM deployment is symmetric, where NVM devices are directly attached to servers. Due to the higher density, NVM provides larger capacity and can be shared among servers. Unfortunately, in the symmetric setting, the availability of NVM devices is affected by the specific machine it is attached to. High availability can be realized by replicating data to NVM on a remote machine. However, it requires full replication of data structure in local memory, limiting the size of the working set. This paper rethinks NVM deployment and makes a case for the asymmetric NVM architecture, which decouples servers from persistent data storage. In the proposed AsymNVM architecture, NVM devices (back-end nodes) can be shared by multiple servers (front-end nodes) and provide recoverable persistent data structures. The asymmetric architecture is made possible by RDMA, and follows the recent industry trend of resource disaggregation. We build AsymNVM framework based on AsymNVM architecture that implements: 1) high performance persistent data structure update; 2) NVM data management; 3) concurrency control; and 4) crash-consistency and replication. The central idea is to use operation logs to reduce the stall due to RDMA writes and enable efficient batching and caching in front-end nodes. To evaluation performance, we construct eight widely used data structures and two applications based on AsymNVM framework, and use traces of industry workloads. In a cluster with ten machines, the results show that AsymNVM achieves comparable performance to the best possible symmetric architecture while avoiding all the drawbacks with disaggregation. Compared to the baseline AsymNVM, speedup brought by the proposed optimizations is 6~22x.

[1]  Jin Xiong,et al.  HiKV: A Hybrid Index Key-Value Store for DRAM-NVM Memory Systems , 2017, USENIX Annual Technical Conference.

[2]  Miguel Castro,et al.  No compromises: distributed transactions with consistency, availability, and performance , 2015, SOSP.

[3]  Andrew Pavlo,et al.  How to Build a Non-Volatile Memory Database Management System , 2017, SIGMOD Conference.

[4]  Scott Shenker,et al.  Network support for resource disaggregation in next-generation datacenters , 2013, HotNets.

[5]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[6]  Jian Yang,et al.  Mojim: A Reliable and Highly-Available Non-Volatile Memory System , 2015, ASPLOS.

[7]  Tao Li,et al.  Octopus: an RDMA-enabled Distributed Persistent Memory File System , 2017, USENIX ATC.

[8]  Michel Raynal,et al.  No Hot Spot Non-blocking Skip List , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[9]  Chandramohan A. Thekkath,et al.  Frangipani: a scalable distributed file system , 1997, SOSP.

[10]  Rajesh K. Gupta,et al.  NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories , 2011, ASPLOS XVI.

[11]  Roy H. Campbell,et al.  Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory , 2011, FAST.

[12]  Yiying Zhang,et al.  LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation , 2018, OSDI.

[13]  Hong Wang,et al.  Density Tradeoffs of Non-Volatile Memory as a Replacement for SRAM Based Last Level Cache , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[14]  Bernhard Seeger,et al.  An asymptotically optimal multiversion B-tree , 1996, The VLDB Journal.

[15]  Hasso Plattner,et al.  nvm malloc: Memory Allocation for NVRAM , 2015, ADMS@VLDB.

[16]  Subramanya Dulloor,et al.  Let's Talk About Storage & Recovery Methods for Non-Volatile Memory Database Systems , 2015, SIGMOD Conference.

[17]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[18]  Thomas F. Wenisch,et al.  Memory persistency , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[19]  Sayantan Sur,et al.  Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.

[20]  Bingsheng He,et al.  NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems , 2015, FAST.

[21]  Anirudh Badam,et al.  Viyojit: Decoupling battery and DRAM capacities for battery-backed DRAM , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[22]  Wojciech M. Golab,et al.  Minuet: A Scalable Distributed Multiversion B-Tree , 2012, Proc. VLDB Endow..

[23]  Kang G. Shin,et al.  Efficient Memory Disaggregation with Infiniswap , 2017, NSDI.

[24]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[25]  Robert E. Tarjan,et al.  Making Data Structures Persistent , 1989, J. Comput. Syst. Sci..

[26]  Chris Okasaki,et al.  Purely functional data structures , 1998 .

[27]  Ismail Oukid,et al.  Memory Management Techniques for Large-Scale Persistent-Main-Memory Systems , 2017, Proc. VLDB Endow..

[28]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[29]  Michael M. Swift,et al.  Aerie: flexible file-system interfaces to storage-class memory , 2014, EuroSys '14.

[30]  Dhabaleswar K. Panda,et al.  High Performance Design for HDFS with Byte-Addressability of NVM and RDMA , 2016, ICS.

[31]  Thomas F. Wenisch,et al.  Delegated persist ordering , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32]  Christoforos E. Kozyrakis,et al.  Flash storage disaggregation , 2016, EuroSys.

[33]  Thomas F. Wenisch,et al.  Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.

[34]  David G. Andersen,et al.  Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[35]  David G. Andersen,et al.  Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[36]  Andrew Warfield,et al.  Decibel: Isolation and Sharing in Disaggregated Rack-Scale Storage , 2017, NSDI.

[37]  J. Chris Anderson,et al.  CouchDB - The Definitive Guide: Time to Relax , 2010 .

[38]  Xueti Tang,et al.  Spin-transfer torque magnetic random access memory (STT-MRAM) , 2013, JETC.

[39]  Jun Yang,et al.  A durable and energy efficient main memory using phase change memory technology , 2009, ISCA '09.

[40]  Sachin Katti,et al.  Reducing DRAM footprint with NVM in Facebook , 2018, EuroSys.

[41]  Julie Silver Chapter 3 – About the Machine , 2004 .

[42]  Thomas F. Wenisch,et al.  High-Performance Transactions for Persistent Memories , 2016, ASPLOS.

[43]  K. Gopalakrishnan,et al.  Phase change memory technology , 2010, 1001.1164.

[44]  Thomas F. Wenisch,et al.  System-level implications of disaggregated memory , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[45]  Jeffrey S. Vetter,et al.  A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[46]  Keir Fraser,et al.  Practical lock-freedom , 2003 .

[47]  Ryan Stutsman,et al.  Crail : A High-Performance I / O Architecture for Distributed Data Processing , .

[48]  Scott Shenker,et al.  Network Requirements for Resource Disaggregation , 2016, OSDI.

[49]  Butler W. Lampson,et al.  Hints for Computer System Design , 1983, IEEE Software.

[50]  Eric Ruppert,et al.  Lock-free linked lists and skip lists , 2004, PODC '04.

[51]  Kang Chen,et al.  RFP: When RPC is Faster than Server-Bypass with RDMA , 2017, EuroSys.

[52]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[53]  Jongman Kim,et al.  An energy- and performance-aware DRAM cache architecture for hybrid DRAM/PCM main memory systems , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[54]  Weimin Zheng,et al.  DudeTM: Building Durable Transactions with Decoupling for Persistent Memory , 2017, ASPLOS.

[55]  Sudipta Sengupta,et al.  The Bw-Tree: A B-tree for new hardware platforms , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[56]  Peter J. Varman,et al.  SoftWrAP: A lightweight framework for transactional support of storage class memory , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[57]  Nikolas Ioannou,et al.  Crail: A High-Performance I/O Architecture for Distributed Data Processing , 2017, IEEE Data Eng. Bull..

[58]  Lingjia Tang,et al.  SmoothOperator: Reducing Power Fragmentation and Improving Power Utilization in Large-scale Datacenters , 2018, ASPLOS.

[59]  Mosharaf Chowdhury,et al.  Distributed Lock Management with RDMA: Decentralization without Starvation , 2018, SIGMOD Conference.

[60]  Carsten Binnig,et al.  The End of a Myth: Distributed Transaction Can Scale , 2016, Proc. VLDB Endow..

[61]  Marcos K. Aguilera,et al.  Remote memory in the age of fast networks , 2017, SoCC.

[62]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[63]  Michael M. Swift,et al.  Mnemosyne: lightweight persistent memory , 2011, ASPLOS XVI.

[64]  Eddie Kohler,et al.  Speedy transactions in multicore in-memory databases , 2013, SOSP.

[65]  Jeff Bonwick,et al.  The Slab Allocator: An Object-Caching Kernel Memory Allocator , 1994, USENIX Summer.

[66]  OHAD RODEH,et al.  B-trees, shadowing, and clones , 2008, TOS.

[67]  Stratis Viglas,et al.  REWIND: Recovery Write-Ahead System for In-Memory Non-Volatile Data-Structures , 2015, Proc. VLDB Endow..

[68]  Yiying Zhang,et al.  Distributed shared persistent memory , 2017, SoCC.

[69]  Hans-Juergen Boehm,et al.  Makalu: fast recoverable allocation of non-volatile memory , 2016, OOPSLA.

[70]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.