Fault impact and fault tolerance in multiprocessor interconnection networks

Growing complexity of parallel machines coupled with increasing chip densities escalates the need for fault tolerance and recovery in these systems. In pursuit of the goal of fault-tolerant multiprocessors, many techniques have been proposed. Since methods for designing fault-tolerant processors and memories are relatively mature, the techniques considered in this paper focus on the interconnection network (ICN) linking the processors. The impact of faults on non-fault-tolerant ICNs is contrasted with that in fault-tolerant networks. Fault tolerance in ICNs is addressed at two levels, inter-node or switch level and system level. Inter-node or switch level pertains to data and control integrity and system level deals with maintaining network connectivity and adequate performance levels in the presence of faults. Fault-tolerant schemes at the switching element level warrant some form of concurrent error detection such as the use of codes usually combined with a full handshake protocol. Space–time trade-offs involved in the use of various codes and protocols are investigated. At the system level, several augmented multi-stage switching ICNs, tree and ring networks are studied. The combined provision for fault tolerance together with improved performance in the non-fault condition is emphasized. Finally, strategies for network reconfiguration and rerouting after system failure are presented.

[1]  Vijay P. Kumar,et al.  Failure Dependent Performance Analysis of a Fault-Tolerant Multistage Interconnection Network , 1989, IEEE Trans. Computers.

[2]  B W Arden,et al.  Analysis of Chordal Ring Network , 1981, IEEE Transactions on Computers.

[3]  Miroslaw Malek,et al.  Cost-effective error detection codes in multicomputer networks , 1987 .

[4]  Tse-Yun Feng,et al.  Fault-Diagnosis for a Class of Multistage Interconnection Networks , 1981, IEEE Trans. Computers.

[5]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[6]  Bruce M. Maggs,et al.  Expanders might be practical: fast algorithms for routing around faults on multibutterflies , 1989, 30th Annual Symposium on Foundations of Computer Science.

[7]  John P. Hayes,et al.  A Graph Model for Fault-Tolerant Computing Systems , 1976, IEEE Transactions on Computers.

[8]  Sudhakar M. Reddy,et al.  Augmented Shuffle-Exchange Multistage Interconnection Networks , 1987, Computer.

[9]  Cauligi S. Raghavendra,et al.  Fault-Tolerant Routing in Multistage Interconnection Networks , 1989, IEEE Trans. Computers.

[10]  Cauligi S. Raghavendra,et al.  The Gamma network: A multiprocessor interconnection network with redundant paths , 1982, ISCA 1982.

[11]  Kishor S. Trivedi,et al.  Multistage Interconnection Network Reliability , 1989, IEEE Trans. Computers.

[12]  John P. Hayes,et al.  Fault-Tolerance of Dynamic-Full-Access Interconnection Networks , 1984, IEEE Transactions on Computers.

[13]  Gregory F. Pfister,et al.  “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.

[14]  Duncan H. Lawrie,et al.  Access and Alignment of Data in an Array Processor , 1975, IEEE Transactions on Computers.

[15]  Dharma P. Agrawal,et al.  Testing and Fault Tolerance of Multistage Interconnection Networks , 1982, Computer.

[16]  John F. Wakerly,et al.  Error detecting codes, self-checking circuits and applications , 1978 .

[17]  Omri Serlin Fault-Tolerant Systems in Commercial Applications , 1984, Computer.

[18]  Suku Nair,et al.  An evaluation of system-level fault tolerance on the Intel hypercube multiprocessor , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[19]  Bernard L. Menezes,et al.  The KYKLOS Multicomputer Network: Interconnection Strategies, Properties, and Applications , 1991, IEEE Trans. Computers.

[20]  Dharma P. Agrawal,et al.  A Survey and Comparision of Fault-Tolerant Multistage Interconnection Networks , 1987, Computer.

[21]  Nian-Feng Tzeng,et al.  A fault-tolerant scheme for multistage interconnection networks , 1985, ISCA '85.

[22]  Jacob A. Abraham,et al.  CONCURRENT FAULT DIAGNOSIS IN MULTIPLE PROCESSOR SYSTEMS. , 1986 .

[23]  Howard Jay Siegel,et al.  The Extra Stage Cube: A Fault-Tolerant Interconnection Network for Supersystems , 1982, IEEE Transactions on Computers.

[24]  Carlo H. Séquin,et al.  Hypertree: A Multiprocessor Interconnection Topology , 1981, IEEE Transactions on Computers.

[25]  Miroslaw Malek,et al.  On the Number of Permutations Performable by Extra-Stage Multistage Interconnection Networks , 1989, IEEE Trans. Computers.