Hydra : Resilient and Highly Available Remote Memory

We present Hydra, a low-latency, low-overhead, and highly available resilience mechanism for remote memory. Hydra can access erasure-coded remote memory within a single-digit microsecond read/write latency, significantly improving the performance-efficiency trade-off over the state-of-the-art -- it performs similar to in-memory replication with 1.6X lower memory overhead. We also propose CodingSets, a novel coding group placement algorithm for erasure-coded data, that provides load balancing while reducing the probability of data loss under correlated failures by an order of magnitude. With Hydra, even when only 50% of memory is local, unmodified memory-intensive applications achieve performance close to that of the fully in-memory case in the presence of remote failures and outperform the state-of-the-art solutions by up to 4.35X.

[1]  Brent E. Stephens,et al.  Justitia: Software Multi-Tenancy in Hardware Kernel-Bypass Networks , 2022, NSDI.

[2]  Onur Mutlu,et al.  Rethinking software runtimes for disaggregated memory , 2021, ASPLOS.

[3]  Boon Thau Loo,et al.  Understanding the effect of data center resource disaggregation on production DBMSs , 2020, Proc. VLDB Endow..

[4]  Mosharaf Chowdhury,et al.  Effectively Prefetching Remote Memory with Leap , 2019, USENIX ATC.

[5]  Marcos K. Aguilera,et al.  AIFM: High-Performance, Application-Integrated Far Memory , 2020, OSDI.

[6]  Onur Mutlu,et al.  Project PBerry: FPGA Acceleration for Remote Memory , 2019, HotOS.

[7]  Jeffrey C. Mogul,et al.  Nines are Not Enough: Meaningful Metrics for Clouds , 2019, HotOS.

[8]  Jichuan Chang,et al.  Software-Defined Far Memory in Warehouse-Scale Computers , 2019, ASPLOS.

[9]  Xiao Liu,et al.  Basic Performance Measurements of the Intel Optane DC Persistent Memory Module , 2019, ArXiv.

[10]  Yiying Zhang,et al.  LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation , 2018, OSDI.

[11]  Marcos K. Aguilera,et al.  Remote regions: a simple abstraction for remote memory , 2018, USENIX ATC.

[12]  Mosharaf Chowdhury,et al.  Distributed Lock Management with RDMA: Decentralization without Starvation , 2018, SIGMOD Conference.

[13]  Amanda Carbonari,et al.  Tolerating Faults in Disaggregated Datacenters , 2017, HotNets.

[14]  Robert Ricci,et al.  Rocksteady: Fast Migration for Low-latency In-memory Storage , 2017, SOSP.

[15]  Marcos K. Aguilera,et al.  Remote memory in the age of fast networks , 2017, SoCC.

[16]  Kang G. Shin,et al.  Performance Isolation Anomalies in RDMA , 2017, KBNets@SIGCOMM.

[17]  Anurag Gupta,et al.  Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases , 2017, SIGMOD Conference.

[18]  Kang G. Shin,et al.  Efficient Memory Disaggregation with Infiniswap , 2017, NSDI.

[19]  Patrick P. C. Lee,et al.  Erasure coding for small objects in in-memory KV storage , 2017, SYSTOR.

[20]  F. Moore,et al.  Polynomial Codes Over Certain Finite Fields , 2017 .

[21]  Scott Shenker,et al.  Network Requirements for Resource Disaggregation , 2016, OSDI.

[22]  Kannan Ramchandran,et al.  EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure Coding , 2016, OSDI.

[23]  David G. Andersen,et al.  FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs , 2016, OSDI.

[24]  Youngjin Kwon,et al.  Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[25]  Feng Li,et al.  Accelerating Relational Databases by Leveraging Remote Memory and RDMA , 2016, SIGMOD Conference.

[26]  Abel Gordon,et al.  Paravirtual Remote I/O , 2016, ASPLOS.

[27]  Kostas Katrinis,et al.  Rack-scale disaggregated cloud data centers: The dReDBox project vision , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[28]  Heng Zhang,et al.  Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication , 2016, FAST.

[29]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX ATC.

[30]  Emin Gün Sirer,et al.  Tiered Replication: A Cost-effective Alternative to Full Cluster Geo-replication , 2015, USENIX Annual Technical Conference.

[31]  Kimberly Keeton,et al.  The Machine: An Architecture for Memory-centric Computing , 2015, ROSS@HPDC.

[32]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[33]  Alfons Kemper,et al.  High-Speed Query Processing over High-Speed Networks , 2015, Proc. VLDB Endow..

[34]  Cory Hill,et al.  f4: Facebook's Warm BLOB Storage System , 2014, OSDI.

[35]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[36]  Michael Kaminsky,et al.  Using RDMA efficiently for key-value services , 2014, SIGCOMM.

[37]  Miguel Castro,et al.  FaRM: Fast Remote Memory , 2014, NSDI.

[38]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[39]  Babak Falsafi,et al.  Scale-out NUMA , 2014, ASPLOS.

[40]  Patrick P. C. Lee,et al.  Parity logging with reserved space: towards efficient updates and recovery in erasure-coded clustered storage , 2014, FAST.

[41]  Krste Asanovic,et al.  FireBox: A Hardware Building Block for 2020 Warehouse-Scale Computers , 2014 .

[42]  Sachin Katti,et al.  Copysets: Reducing the Frequency of Data Loss in Cloud Storage , 2013, USENIX Annual Technical Conference.

[43]  Jinyang Li,et al.  Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[44]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[45]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[46]  Rizal Setya Perdana What is Twitter , 2013 .

[47]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[48]  Quanyan Zhu,et al.  Dynamic energy-aware capacity provisioning for cloud computing environments , 2012, ICAC '12.

[49]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[50]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[51]  Ren Wang,et al.  Evaluating Dynamics and Bottlenecks of Memory Collaboration in Cluster Systems , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[52]  Thomas F. Wenisch,et al.  System-level implications of disaggregated memory , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[53]  Robert J. Chansler,et al.  Data Availability and Durability with the Hadoop Distributed File System , 2012, login Usenix Mag..

[54]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[55]  Jichuan Chang,et al.  Disaggregated Memory Benefits for Server Consolidation , 2011 .

[56]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[57]  Jeffrey Dean,et al.  Evolution and future directions of large-scale storage and computation systems at Google , 2010, SoCC '10.

[58]  Xiaosong Ma,et al.  Does erasure coding have a role to play in my data center , 2010 .

[59]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[60]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[61]  Jinyang Li,et al.  Building fast, distributed programs with partitioned tables , 2010 .

[62]  Thomas F. Wenisch,et al.  Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.

[63]  Yingwei Luo,et al.  A Transparent Remote Paging Model for Virtual Machines , 2008 .

[64]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[65]  Dhabaleswar K. Panda,et al.  Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device , 2005, 2005 IEEE International Conference on Cluster Computing.

[66]  Marcos K. Aguilera,et al.  Using erasure codes efficiently for storage in a distributed system , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[67]  Evangelos P. Markatos,et al.  The Network RamDisk: Using remote memory on heterogeneous NOWs , 1999, Cluster Computing.

[68]  Kuzman Ganchev,et al.  Nswap: A Network Swapping Module for Linux Clusters , 2003, Euro-Par.

[69]  Ramesh K. Sitaraman,et al.  The power of two random choices: a survey of tech-niques and results , 2001 .

[70]  Nikolaos Hardavellas,et al.  Cashmere-VLM: Remote memory paging for software distributed shared memory , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[71]  Evangelos P. Markatos,et al.  Implementation of a Reliable Remote Memory Pager , 1996, USENIX ATC.

[72]  Anna R. Karlin,et al.  Implementing global memory management in a workstation cluster , 1995, SOSP.