Some practical issues in the design of fault-tolerant multiprocessors

A node-covering approach to fault-tolerant design is generalized to apply to a wide class of multiprocessor structures whose structure and failure mechanisms are represented by arbitrary graphs. Several new types of covering graphs are defined, which lead to various design tradeoffs. A new technique for incremental design, using a class of switch implementations that reduce a system's interconnection costs, is presented. The reduction of other cost factors is addressed, including VLSI layout area minimization, efficient transfer of state information during recovery, and the efficient use of local spares. A fast and distributed algorithm for reconfiguration around faults is presented. A review of the general node covering theory is included, focusing on how it models the important practical features of fault-tolerant systems.<<ETX>>

[1]  Norbert Leser The distributed computing environment naming architecture , 1993, Distributed Syst. Eng..

[2]  José M. Piquer Indirect Reference Counting: A Distributed Garbage Collection Algorithm , 1991, PARLE.

[3]  Marc Shapiro,et al.  SSP Chains: Robust, Distributed References Supporting Acyclic Garbage Collection , 1993 .

[4]  Hans-Juergen Boehm,et al.  Garbage collection in an uncooperative environment , 1988, Softw. Pract. Exp..

[5]  Christian Queinnec,et al.  Garbage collecting the world , 1992, POPL '92.

[6]  Barbara Liskov,et al.  Garbage collection of a distributed heap , 1992, [1992] Proceedings of the 12th International Conference on Distributed Computing Systems.

[7]  Peter Boehler Bishop,et al.  Computer systems with a very large address space and garbage collection , 1977 .

[8]  Laurent Amsaleg,et al.  Object Grouping in Eos , 1992, IWDOM.

[9]  John P. Hayes,et al.  On Designing and Reconfiguring k-Fault-Tolerant Tree Architectures , 1990, IEEE Trans. Computers.

[10]  Thierry Le Sergent,et al.  Incremental Multi-threaded Garbage Collection on Virtual Shared Memory Architectures , 1992, IWMM.

[11]  John P. Hayes,et al.  A Graph Model for Fault-Tolerant Computing Systems , 1976, IEEE Transactions on Computers.

[12]  André Schiper,et al.  Lightweight causal and atomic group multicast , 1991, TOCS.

[13]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[14]  John Hughes A Distributed Garbage Collection Algorithm , 1985, FPCA.

[15]  Arnold L. Rosenberg,et al.  The Diogenes Approach to Testable Fault-Tolerant Arrays of Processors , 1983, IEEE Transactions on Computers.

[16]  W. Kent Fuchs,et al.  Reconfigurable Cube-Connected Cycles Architectures , 1990, J. Parallel Distributed Comput..

[17]  Frank Thomson Leighton,et al.  A Framework for Solving VLSI Graph Layout Problems , 1983, J. Comput. Syst. Sci..

[18]  Peter William Dickman,et al.  Distributed object management in a non-small graph of autonomous networks with few failures , 1991 .

[19]  Andrew P. Black,et al.  Fine-grained mobility in the Emerald system , 1987, TOCS.

[20]  Olivier Gruber,et al.  A garbage detection protocol for a realistic distributed object-support system , 1990 .

[21]  Damien Doligez,et al.  A concurrent, generational garbage collector for a multithreaded implementation of ML , 1993, POPL '93.

[22]  Paul R. Wilson,et al.  Uniprocessor Garbage Collection Techniques , 1992, IWMM.

[23]  Marc Shapiro,et al.  Robust, distributed references and acyclic garbage collection , 1992, PODC '92.

[24]  Franco P. Preparata,et al.  The cube-connected-cycles: A versatile network for parallel computation , 1979, 20th Annual Symposium on Foundations of Computer Science (sfcs 1979).

[25]  Robert J. Fowler,et al.  The complexity of using forwarding addresses for decentralized object finding , 1986, PODC '86.

[26]  Kenneth P. Birman,et al.  Process Membership in Asynchronous Environments , 1993 .

[27]  Paul Watson,et al.  An Efficient Garbage Collection Scheme for Parallel Computer Architectures , 1987, PARLE.

[28]  S. C. Vestal,et al.  Garbage collection: an exercise in distributed, fault-tolerant programming , 1987 .

[29]  Edward Wobber,et al.  Network objects , 1994, SOSP '93.

[30]  Setrag Khoshafian,et al.  Object identity , 1986, OOPLSA '86.

[31]  Niels Christian Juul,et al.  Comprehensive and Robust Garbage Collection in a Distributed System , 1992, IWMM.

[32]  John P. Hayes,et al.  A Microprocessor-based Hypercube Supercomputer , 1986, IEEE Micro.

[33]  D. I. Bevan,et al.  Distributed Garbage Collection Using Reference Counting , 1987, PARLE.

[34]  Kenneth E. Batcher,et al.  Design of a Massively Parallel Processor , 1980, IEEE Transactions on Computers.

[35]  Milos D. Ercegovac,et al.  Fault Tolerance in Binary Tree Architectures , 1984, IEEE Transactions on Computers.

[36]  W. Daniel Hillis,et al.  The connection machine , 1985 .