Fault-Tolerance Considerations in Large, Multiple-Processor Systems

Researcher have long conjectured upon the possibility of constructing large, massively-parallel computing engines by interconnecting many conventional processing elements to form an integrated supersystem. The rapid expansion in very large scale integration, or VLSI, circuit technology during the past decade has accelerated research in this direction. As advances in VLSI push basic component or chip functionalities to the processor level and beyond, it becomes natural to view complex processing elements as the basic components of much larger systems. Several names for such systems have been proposed, including network computers, multicomputers, and distributed multiprocessors. Despite the naming differences, these systems have the following salient features: (1) A large number of basically autonomous processing elements interconnected by a structure that allows high-bandwidth communication between them. At the system level, these processing elements and interconnection facilities are viewed as the basic components of the system. Each processing node has its own local memory and there is no sharing of memory between nodes. (2) A high degree of distribution of control or operating system functions among the processing elements. (3) Highly parallel computation performed by constructing applications as collections of several or many distinct tasks. These tasks may execute concurrently on different processors, withmore » necessary intertask communication carried out over the communication facilities linking the nodes. The collection of cooperating tasks comprising an application is sometimes referred to as a task force.« less

[1]  Satish M. Thatte,et al.  Concurrent Checking of Program Flow in VLSI Processors , 1982, ITC.

[2]  Daniel Gajski,et al.  CEDAR: a large scale multiprocessor , 1983, CARN.

[3]  Omri Serlin Fault-Tolerant Systems in Commercial Applications , 1984, Computer.

[4]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[5]  David A. Patterson,et al.  X-Tree: A tree structured multi-processor computer architecture , 1978, ISCA '78.

[6]  Daniel P. Siewiorek Architecture of Fault-Tolerant Computers , 1984, Computer.

[7]  Sudhakar M. Reddy,et al.  Design and analysis of fault-tolerant multistage interconnection networks with low link complexity , 1985, ISCA '85.

[8]  Miroslaw Malek,et al.  A comparison connection assignment for diagnosis of multiprocessor systems , 1980, ISCA '80.

[9]  David A. Rennels,et al.  Fault-Tolerant Computing—Concepts and Examples , 1984, IEEE Transactions on Computers.

[10]  Abbas El Gamal,et al.  Configuration of VLSI Arrays in the Presence of Defects , 1984, JACM.

[11]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[12]  Bernard Elspas,et al.  Topological constraints on interconnection-limited logic , 1964, SWCT.

[13]  Karsten Schwan,et al.  StarOS, a multiprocessor operating system for the support of task forces , 1979, SOSP '79.

[14]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[15]  Sudhakar M. Reddy,et al.  A Class of Graphs for Processor Interconnection , 1983, ICPP.

[16]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[17]  James E. Smith,et al.  Self-Diagnosis in Distributed Systems , 1985, IEEE Transactions on Computers.

[18]  Tse-yun Feng,et al.  A Survey of Interconnection Networks , 1981, Computer.

[19]  Larry D. Wittie Micronet: A reconfigurable microcomputer network for distributed systems research , 1978 .

[20]  Karl W. Doty,et al.  New Designs for Dense Processor Interconnection Networks , 1984, IEEE Transactions on Computers.

[21]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[22]  John P. Hayes,et al.  A Graph Model for Fault-Tolerant Computing Systems , 1976, IEEE Transactions on Computers.

[23]  Sudhakar M. Reddy,et al.  A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair , 1984, IEEE Transactions on Computers.

[24]  Arthur D. Friedman,et al.  System-Level Fault Diagnosis , 1980, Computer.