Smart Redundancy for Distributed Computation

Many distributed software systems allow participation by large numbers of untrusted, potentially faulty components on an open network. As faults are inevitable in this setting, these systems utilize redundancy and replication to achieve fault tolerance. In this paper, we present a novel "smart" redundancy technique called iterative redundancy, which ensures efficient replication of computation and data given finite processing and storage resources, even when facing Byzantine faults. Iterative redundancy is more efficient and more adaptive than comparable state-of-the-art techniques that operate in environments with unknown system resource reliability. We show how systems that solve computational problems using a network of independent nodes can benefit from iterative redundancy. We present a formal analytical analysis and an empirical analysis, demonstrate iterative redundancy on a real-world volunteer-computing system, and compare it to existing methods.

[1]  Samir Aknine,et al.  Towards autonomic fault-tolerant multi-agent systems , 2007 .

[2]  Michael K. Reiter,et al.  Fault-scalable Byzantine fault-tolerant services , 2005, SOSP '05.

[3]  Mario Lauria,et al.  The organic grid: self-organizing computation on a peer-to-peer network , 2004, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[4]  André Schiper,et al.  Optimistic active replication , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[5]  Eitan M. Gurari,et al.  Introduction to the theory of computation , 1989 .

[6]  Roy Friedman,et al.  A replication- and checkpoint-based approach for anomaly-based intrusion detection and recovery , 2005, 25th IEEE International Conference on Distributed Computing Systems Workshops.

[7]  Fred B. Schneider,et al.  The primary-backup approach , 1993 .

[8]  J. Xu,et al.  An adaptive approach to achieving hardware and software fault tolerance in a distributed computing environment , 2002, J. Syst. Archit..

[9]  Henning Schulzrinne,et al.  An Analysis of the Skype Peer-to-Peer Internet Telephony Protocol , 2004, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[10]  J. E. Glynn,et al.  Numerical Recipes: The Art of Scientific Computing , 1989 .

[11]  Luis F. G. Sarmenta Sabotage-tolerance mechanisms for volunteer computing systems , 2002, Future Gener. Comput. Syst..

[12]  Rajkumar Buyya,et al.  GridCrypt: High Performance Symmetric Key using Enterprise Grids , 2004 .

[13]  David E. Culler,et al.  A blueprint for introducing disruptive technology into the Internet , 2003, CCRV.

[14]  Soonwook Hwang,et al.  A Flexible Framework for Fault Tolerance in the Grid , 2003, Journal of Grid Computing.

[15]  M. Prakash,et al.  Fault Tolerance-Genetic Algorithm for Grid Task Scheduling using Check Point , 2007, Sixth International Conference on Grid and Cooperative Computing (GCC 2007).

[16]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[17]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[18]  Arthur D. Friedman,et al.  Fault detection in digital circuits , 1971 .

[19]  Ian Clarke,et al.  Freenet: A Distributed Anonymous Information Storage and Retrieval System , 2000, Workshop on Design Issues in Anonymity and Unobservability.

[20]  Rajkumar Buyya,et al.  GridCrypt: High Performance Symmetric Key Cryptography Using Enterprise Grids , 2004, PDCAT.

[21]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[22]  Nenad Medvidovic,et al.  A Highly Extensible Simulation Framework for Domain-Specific Architectures , 2009 .

[23]  Jorge Andrade,et al.  Using Grid Technology for Computationally Intensive Applied Bioinformatics Analyses , 2006, Silico Biol..

[24]  M. Lamanna The LHC computing grid project at CERN , 2004 .

[25]  Alekh Jindal,et al.  Hadoop++ , 2010 .

[26]  Pankaj Jalote,et al.  Fault tolerance in distributed systems , 1994 .

[27]  A. Singh,et al.  Fault-tolerant systems , 1990, Computer.

[28]  Peter Alan Lee,et al.  Fault Tolerance , 1990, Dependable Computing and Fault-Tolerant Systems.

[29]  Christian Benjamin Ries Berkeley Open Infrastructure for Network Computing , 2012 .

[30]  Marcos K. Aguilera,et al.  Consensus with Byzantine Failures and Little System Synchrony , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Kazuo Asakawa,et al.  Stock market prediction system with modular neural networks , 1990, 1990 IJCNN International Joint Conference on Neural Networks.