Mitigating the Performance-Efficiency Tradeoff in Resilient Memory Disaggregation

Memory disaggregation has received attention in recent years as a promising idea to reduce the total cost of ownership (TCO) of memory in modern datacenters. However, relying on remote memory expands an application's failure domain and makes it susceptible to tail latency variations. In attempts to making disaggregated memory resilient, stateof-the-art solutions face the classic tradeoff between performance and efficiency: some double the memory overhead of disaggregation by replicating to remote memory, while many others limit performance by replicating to the local disk. We present Hydra, a configurable, erasure-coded resilience mechanism for common memory disaggregation solutions. It can transparently handle uncertainties arising from remote failures, evictions, memory corruptions, and stragglers from network imbalance with a significantly better performance-efficiency tradeoff than the state-of-the-art. We design a fine-tuned data path to achieve single us read/write latency to remote memory, develop decentralized algorithms for cluster-wide memory management, and analyze how to select parameters to mitigate independent and correlated uncertainties. Our integration of Hydra with two major memory disaggregation systems and evaluation on a 50-machine RDMA cluster demonstrates that it achieves the best of both worlds: it improves the latency and throughput of memory-intensive applications by up to 64.78X and 20.61X, respectively, over the state-of-the-art disk backup-based solution. At the same time, it provides performance similar to that of in-memory replication with 1.6X lower memory overhead.

[1]  Boon Thau Loo,et al.  Understanding the effect of data center resource disaggregation on production DBMSs , 2020, Proc. VLDB Endow..

[2]  Dushyanth Narayanan,et al.  Fast General Distributed Transactions with Opacity , 2019, SIGMOD Conference.

[3]  Jeffrey C. Mogul,et al.  Nines are Not Enough: Meaningful Metrics for Clouds , 2019, HotOS.

[4]  Jichuan Chang,et al.  Software-Defined Far Memory in Warehouse-Scale Computers , 2019, ASPLOS.

[5]  Michael J. Freedman,et al.  Who's Afraid of Uncorrectable Bit Errors? Online Recovery of Flash Errors with Distributed Redundancy , 2019, USENIX Annual Technical Conference.

[6]  Yiying Zhang,et al.  LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation , 2018, OSDI.

[7]  Marcos K. Aguilera,et al.  Remote regions: a simple abstraction for remote memory , 2018, USENIX ATC.

[8]  Gang Chen,et al.  Efficient Distributed Memory Management with RDMA and Caching , 2018, Proc. VLDB Endow..

[9]  Mosharaf Chowdhury,et al.  Distributed Lock Management with RDMA: Decentralization without Starvation , 2018, SIGMOD Conference.

[10]  Amanda Carbonari,et al.  Tolerating Faults in Disaggregated Datacenters , 2017, HotNets.

[11]  Robert Ricci,et al.  Rocksteady: Fast Migration for Low-latency In-memory Storage , 2017, SOSP.

[12]  Marcos K. Aguilera,et al.  Remote memory in the age of fast networks , 2017, SoCC.

[13]  Kang G. Shin,et al.  Performance Isolation Anomalies in RDMA , 2017, KBNets@SIGCOMM.

[14]  Kang G. Shin,et al.  Efficient Memory Disaggregation with Infiniswap , 2017, NSDI.

[15]  Patrick P. C. Lee,et al.  Erasure coding for small objects in in-memory KV storage , 2017, SYSTOR.

[16]  Carsten Binnig,et al.  The End of a Myth: Distributed Transaction Can Scale , 2016, Proc. VLDB Endow..

[17]  F. Moore,et al.  Polynomial Codes Over Certain Finite Fields , 2017 .

[18]  Seok-Hee Lee,et al.  Technology scaling challenges and opportunities of memory devices , 2016, 2016 IEEE International Electron Devices Meeting (IEDM).

[19]  Scott Shenker,et al.  Network Requirements for Resource Disaggregation , 2016, OSDI.

[20]  Kannan Ramchandran,et al.  EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure Coding , 2016, OSDI.

[21]  Youngjin Kwon,et al.  Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[22]  Feng Li,et al.  Accelerating Relational Databases by Leveraging Remote Memory and RDMA , 2016, SIGMOD Conference.

[23]  Abel Gordon,et al.  Paravirtual Remote I/O , 2016, ASPLOS.

[24]  Heng Zhang,et al.  Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication , 2016, FAST.

[25]  Gahyun Park,et al.  A Generalization of Multiple Choice Balls-into-Bins: Tight Bounds , 2012, Algorithmica.

[26]  Amin Vahdat,et al.  TIMELY: RTT-based Congestion Control for the Datacenter , 2015, Comput. Commun. Rev..

[27]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX ATC.

[28]  Emin Gün Sirer,et al.  Tiered Replication: A Cost-effective Alternative to Full Cluster Geo-replication , 2015, USENIX Annual Technical Conference.

[29]  Gustavo Alonso,et al.  Rack-Scale In-Memory Join Processing using RDMA , 2015, SIGMOD Conference.

[30]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[31]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[32]  Ion Stoica,et al.  The Power of Choice in Data-Aware Cluster Scheduling , 2014, OSDI.

[33]  Cory Hill,et al.  f4: Facebook's Warm BLOB Storage System , 2014, OSDI.

[34]  Michael Kaminsky,et al.  Using RDMA efficiently for key-value services , 2014, SIGCOMM.

[35]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[36]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[37]  Babak Falsafi,et al.  Scale-out NUMA , 2014, ASPLOS.

[38]  Patrick P. C. Lee,et al.  Parity logging with reserved space: towards efficient updates and recovery in erasure-coded clustered storage , 2014, FAST.

[39]  Hongzhong Zheng,et al.  Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling , 2014 .

[40]  Sachin Katti,et al.  Copysets: Reducing the Frequency of Data Loss in Cloud Storage , 2013, USENIX Annual Technical Conference.

[41]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[42]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[43]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[44]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[45]  Quanyan Zhu,et al.  Dynamic energy-aware capacity provisioning for cloud computing environments , 2012, ICAC '12.

[46]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[47]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[48]  Ren Wang,et al.  Evaluating Dynamics and Bottlenecks of Memory Collaboration in Cluster Systems , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[49]  Thomas F. Wenisch,et al.  System-level implications of disaggregated memory , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[50]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[51]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[52]  Xiaosong Ma,et al.  Does erasure coding have a role to play in my data center , 2010 .

[53]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[54]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[55]  Jinyang Li,et al.  Building fast, distributed programs with partitioned tables , 2010 .

[56]  Albert G. Greenberg,et al.  The nature of data center traffic: measurements & analysis , 2009, IMC '09.

[57]  Thomas F. Wenisch,et al.  Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.

[58]  Yingwei Luo,et al.  A Transparent Remote Paging Model for Virtual Machines , 2008 .

[59]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[60]  Dhabaleswar K. Panda,et al.  Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device , 2005, 2005 IEEE International Conference on Cluster Computing.

[61]  Marcos K. Aguilera,et al.  Using erasure codes efficiently for storage in a distributed system , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[62]  Evangelos P. Markatos,et al.  The Network RamDisk: Using remote memory on heterogeneous NOWs , 1999, Cluster Computing.

[63]  Kuzman Ganchev,et al.  Nswap: A Network Swapping Module for Linux Clusters , 2003, Euro-Par.

[64]  Ramesh K. Sitaraman,et al.  The power of two random choices: a survey of tech-niques and results , 2001 .

[65]  Eli Upfal,et al.  Balanced Allocations , 1999, SIAM J. Comput..

[66]  Nikolaos Hardavellas,et al.  Cashmere-VLM: Remote memory paging for software distributed shared memory , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[67]  Jeanna Neefe Matthews,et al.  An Exploration of Network RAM , 1998 .

[68]  Evangelos P. Markatos,et al.  Implementation of a Reliable Remote Memory Pager , 1996, USENIX ATC.

[69]  Anna R. Karlin,et al.  Implementing global memory management in a workstation cluster , 1995, SOSP.