Fault tolerance in networks of bounded degree

Achieving processor cooperation in the presence of faults is a major problem in distributed systems. Popular paradigms such as Byzantine agreement have been studied principally in the context of a complete network. Indeed, Dolev (J. Algorithms, 3 (1982), pp. 14-30) and Hadzilacos (Issues of Fault Tolerance in Concurrent Computations, Ph.D. thesis, Harvard University, Cambridge, MA, 1984) have shown that fl(t) connectivity is necessary if the requirement is that all nonfaulty processors decide unanlmously, where is the number of faults to be tolerated. We believe that in forseeable technologies the number of faults will grow with the size of the network while the degree will remain practically fixed. We therefore raise the question whether it is possible to avoid the connectivity requirements by slightly lowering our expectations. In many practical situations we may be willing to "lose" some correct processors and settle for cooperation between the vast majority of the processors. Thus motivated, we present a general simulation technique by which vertices (processors) in almost any network ofbounded degree can simulate an algorithm designed for the complete network. The simulation has the property that although some correct processors may be cut off from the majority of the network by faulty processors, the vast majority of the correct processors will be able to communicate among themselves undisturbed by the (arbitrary) behavior of the faulty nodes. We define a new paradigm for distributed computing, almost-everywhere agreement, in which we require only that almost all correct processors reach consensus. Unlike the traditional Byzantine agreement problem, almost-everywhere agreement can be solved on networks of bounded degree. Specifically, we can simulate any sufficiently resilient Byzantine agreement algorithm on a network ofbounded degree using our communi- cation scheme described above. Although we "lose" some correct processors, effectively treating them as faulty, the vast majority of correct processors decide on a common value. 1. Preliminaries. In 1982 Dolev (D) published the following damning result for distributed computing: "Byzantine agreement is achievable only ifthe number of faulty processors in the system is less than one-half of the connectivity of the system's network." Even in the absence of malicious failures connectivity + 1 is required to achieve agreement in the presence of faulty processors (H). The results are viewed as damning because of the fundamental nature of the Byzantine agreement problem. In this problem each processor begins with an initial value drawn from some domain V of possible values. At some point during the computation, during which processors repeatedly exchange messages and perform local computations, each processor must irreversibly decide on a value, subject to two conditions. No two correct processors may decide on different values, and if all correct processors begin with the same value v, then v must be the common decision value. (See (F) for a survey of related problems.) The ability to achieve this type of coordina- tion is important in a wide range of applications, such as database management, fault-tolerant analysis of sensor readings, and coordinated control of multiple agents. A simple corollary of the results of Dolev and Hadzilacos is that in order for a system to be able to reach Byzantine agreement in the presence of up to faulty processors, every processor must be directly connected to at least fl(t) others. Such high connectivity, while feasible in a small system, cannot be implemented at reasonable cost in a large system. As technology improves, increasingly large distributed systems and parallel com-