Fault Tolerance in Distributed Systems Using Fused Data Structures

Replication is the prevalent solution to tolerate faults in large data structures hosted on distributed servers. To tolerate f crash faults (dead/unresponsive data structures) among n distinct data structures, replication requires f + 1 replicas of each data structure, resulting in nf additional backups. We present a solution, referred to as fusion that uses a combination of erasure codes and selective replication to tolerate f crash faults using just f additional fused backups. We show that our solution achieves O(n) savings in space over replication. Further, we present a solution to tolerate f Byzantine faults (malicious data structures), that requires only nf + f backups as compared to the 2nf backups required by replication. We explore the theory of fused backups and provide a library of such backups for all the data structures in the Java Collection Framework. The theoretical and experimental evaluation confirms that the fused backups are space-efficient as compared to replication, while they cause very little overhead for normal operation. To illustrate the practical usefulness of fusion, we use fused backups for reliability in Amazon's highly available key-value store, Dynamo. While the current replication-based solution uses 300 backup structures, we present a solution that only requires 120 backup structures. This results in savings in space as well as other resources such as power.

[1]  Vijay K. Garg,et al.  Fused Data Structures for Handling Multiple Faults in Distributed Systems , 2011, 2011 31st International Conference on Distributed Computing Systems.

[2]  H KatzRandy,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988 .

[3]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[4]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[5]  Vijay K. Garg Implementing Fault-Tolerant Services Using State Machines: Beyond Replication , 2010, DISC.

[6]  Daniel A. Spielman,et al.  Practical loss-resilient codes , 1997, STOC '97.

[7]  Lihao Xu,et al.  Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications , 2006, Fifth IEEE International Symposium on Network Computing and Applications (NCA'06).

[8]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[9]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[10]  B SchneiderFred Implementing fault-tolerant services using the state machine approach: a tutorial , 1990 .

[11]  Michael O. Rabin,et al.  Efficient dispersal of information for security, load balancing, and fault tolerance , 1989, JACM.

[12]  Parag Agrawal,et al.  The case for RAMClouds: scalable high-performance storage entirely in DRAM , 2010, OPSR.

[13]  James S. Plank A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[14]  Leslie Lamport,et al.  The Implementation of Reliable Distributed Multiprocess Systems , 1978, Comput. Networks.

[15]  Keith Marzullo,et al.  Comparing primary-backup and state machines for crash failures , 1996, PODC '96.

[16]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[17]  Michael Luby,et al.  LT codes , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[18]  Michael Luby,et al.  A digital fountain approach to reliable distribution of bulk data , 1998, SIGCOMM '98.

[19]  Vijay K. Garg,et al.  A fusion-based approach for tolerating faults in finite state machines , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[20]  Marek Karpinski,et al.  An XOR-based erasure-resilient coding scheme , 1995 .

[21]  Elwyn R. Berlekamp,et al.  Algebraic coding theory , 1984, McGraw-Hill series in systems science.

[22]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[23]  Jacobus H. van Lint,et al.  Introduction to Coding Theory , 1982 .

[24]  Vijay K. Garg,et al.  Fusible Data Structures for Fault-Tolerance , 2007, ICDCS.

[25]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[26]  James S. Plank,et al.  A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems , 1997, Softw. Pract. Exp..

[27]  Yuan Zhou Introduction to Coding Theory , 2010 .

[28]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[29]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.