Design and Evaluation of Hardware Strategies for Reconfiguring Hypercubes and Meshes Under Faults

This paper discusses the design of two reconfiguration strategies for distributed memory multicomputer architectures under failures. The specific architectures to which we apply the techniques are hypercubes and meshes. The first scheme uses spare processors attached to certain processors in the hypercube or mash using a novel embedding technique. The second approach places spare processors along specific links in the hypercube or mesh. Both schemes involve the mapping of logical links of a virtual machine onto a set of physical links in the final reconfigured machine and hence suffer some performance degradation. We characterize the performance degradation through trace-driven simulation of real applications running on the faulty and reconfigured system. We find that the schemes have high reliability, suffer little degradation in performance, and are very low in cost. >

[1]  S. F. Nugent,et al.  The iPSC/2 direct-connect communications technology , 1988, C3P.

[2]  Hussein G. Badr,et al.  An Optimal Shortest-Path Routing Policy for Network Computers with Regular Mesh-Connected Topologies , 1989, IEEE Trans. Computers.

[3]  Rami G. Melhem,et al.  An Efficient Modular Spare Allocation Scheme and Its Application to Fault Tolerant Binary Hypercubes , 1991, IEEE Trans. Parallel Distributed Syst..

[4]  Herb Schwetman,et al.  CSIM: a C-based process-oriented simulation language , 1986, WSC '86.

[5]  Prithviraj Banerjee,et al.  PACE2: an improved parallel VLSI extractor with parameter extraction , 1989, 1989 IEEE International Conference on Computer-Aided Design. Digest of Technical Papers.

[6]  Prithviraj Banerjee Strategies for reconfiguring hypercubes under faults , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[7]  G. C. Fox,et al.  Solving Problems on Concurrent Processors , 1988 .

[8]  Dirk Grunwald,et al.  Hyperswitch network for the hypercube computer , 1988, ISCA '88.

[9]  Prithviraj Banerjee,et al.  Distributed algorithms for shortest-path, deadlock-free routing and broadcasting in arbitrarily faulty hypercubes , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[10]  Charles L. Seitz,et al.  The cosmic cube , 1985, CACM.

[11]  Frank Thomson Leighton,et al.  Reconfiguring a hypercube in the presence of faults , 1987, STOC.

[12]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[13]  Manish Gupta,et al.  Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers , 1992, IEEE Trans. Parallel Distributed Syst..

[14]  John F. Wakerly,et al.  Error detecting codes, self-checking circuits and applications , 1978 .

[15]  Prithviraj Banerjee,et al.  Performance Measurement and Trace Driven Simulation of Parallel CAD and Numeric Applications on a Hypercube Multicomputer , 1992, IEEE Trans. Parallel Distributed Syst..

[16]  Leonard Kleinrock,et al.  Virtual Cut-Through: A New Computer Communication Switching Technique , 1979, Comput. Networks.

[17]  Arthur L. Liestman,et al.  A proposal for a fault-tolerant binary hypercube architecture , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[18]  J. P. Hayes,et al.  Routing and broadcasting in faulty hypercube computers , 1988, C3P.

[19]  Ming-Syan Chen,et al.  Depth-First Search Approach for Fault-Tolerant Routing in Hypercube , 1990, IEEE Trans. Parallel Distributed Syst..

[20]  Srinivas Patil,et al.  A parallel branch and bound algorithm for test generation , 1990, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[21]  Quentin F. Stout,et al.  Hypercube message routing in the presence of faults , 1988, C3P.

[22]  Torleiv Kløve,et al.  Error detecting codes , 1995 .

[23]  Suku Nair,et al.  An evaluation of system-level fault tolerance on the Intel hypercube multiprocessor , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[24]  M. Livingston,et al.  Distributing resources in hypercube computers , 1988, C3P.

[25]  Prithviraj Banerjee,et al.  A Parallel Row-Based Algorithm for Standard Cell Placement with Integrated Error Control , 1989, 26th ACM/IEEE Design Automation Conference.