An Efficient Fault Tolerance Framework for Distributed In-Memory Caching Systems

With the development of the information age, many large database applications have introduced distributed in-memory object caching systems, of which Memcached is one of the most typical. However, Memcached does not have fault-tolerant capabilities. In order to make Memcached enable fault tolerance, Cocytus introduced Reed-Solomon codes and distributed protocols into Memcached. Cocytus saves significant memory compared to primary-backup replication when tolerating the same number of failures. However, the relatively complex finite-field calculations used by RS codes and the high network transmission cost during data reconstruction are becoming new bottlenecks. This paper introduces RDP codes into distributed Memcached to optimize the calculation performance in Cocytus. In addition, this paper adopts RDOR scheme and Collective Reconstruction Read to speed up the data reconstruction. Compared with Cocytus, which uses RS codes for fault tolerance, the new distributed Memcached with 4 data nodes and 2 check parity nodes reduces reconstruction overhead by up to 31 %.

[1]  Lihao Xu,et al.  Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications , 2006, Fifth IEEE International Symposium on Network Computing and Applications (NCA'06).

[2]  John C. S. Lui,et al.  Optimal recovery of single disk failure in RDP code storage systems , 2010, SIGMETRICS '10.

[3]  Heng Zhang,et al.  Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication , 2016, FAST.

[4]  Nāgārjuna,et al.  A Secure Erasure Code-Based Cloud Storage System with Secure Data Forwarding , 2014 .

[5]  Viktor Mayer-Schnberger,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2013 .

[6]  Fred B. Schneider,et al.  The primary-backup approach , 1993 .

[7]  C. Canudas-de-Wit,et al.  Differential coding in networked controlled linear systems , 2006, 2006 American Control Conference.

[8]  Gang Wang,et al.  Parallelizing Degraded Read for Erasure Coded Cloud Storage Systems Using Collective Communications , 2016, 2016 IEEE Trustcom/BigDataSE/ISPA.

[9]  Irving S. Reed,et al.  Reed-Solomon Codes , 1999 .

[10]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[11]  Katharine Armstrong,et al.  Big data: a revolution that will transform how we live, work, and think , 2014 .

[12]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[13]  Peter F. Corbett,et al.  Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!) , 2004, USENIX Conference on File and Storage Technologies.

[14]  Dhabaleswar K. Panda,et al.  High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[15]  Liu Yu-shu Design of a Distributed MEM-Agent System , 2005 .

[16]  Marek Karpinski,et al.  An XOR-based erasure-resilient coding scheme , 1995 .

[17]  Jure Petrovic,et al.  Using Memcached for Data Distribution in Industrial Environment , 2008, Third International Conference on Systems (icons 2008).

[18]  Kannan Ramchandran,et al.  A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster , 2013, HotStorage.

[19]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[20]  Fengyuan Ren,et al.  Modeling and Analyzing Latency in the Memcached system , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[21]  Jorge Castiñeira Moreira,et al.  Reed–Solomon Codes , 2006 .

[22]  Panagiotis Papadopoulos,et al.  ACaZoo: A Distributed Key-Value Store Based on Replicated LSM-Trees , 2014, 2014 IEEE 33rd International Symposium on Reliable Distributed Systems.

[23]  Stephen B. Wicker,et al.  Reed-Solomon Codes and Their Applications , 1999 .

[24]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[25]  GhemawatSanjay,et al.  The Google file system , 2003 .

[26]  Sayantan Sur,et al.  Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.

[27]  Bakary S. Toure Relational database management system using Ada , 1984 .

[28]  Frédérique E. Oggier,et al.  Sparsity Exploiting Erasure Coding for Resilient Storage and Efficient I/O Access in Delta Based Versioning Systems , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[29]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.