Running algorithms efficiently on faulty hypercubes

We examine the issue of running algorithms with a constant factor slowdown on a faulty hypercube in a worst case scenario. We present two sets of novel results related to this issue. We first consider edge faults and show how to tolerate faults with a constant factor slowdown in communication and no slowdown in computation. The key to our approach is an efficient method for embedding a fault free Cube Connected Cycles (CCC) graph in the faulty hypercube. Using this embedding we can run ascend-descend algorithms (such as bitonic sort) on the faulty hypercube by implementing them on the embedded CCC. We then consider hypercubes with both edge and node faults. We prove that for any constant c there exists a fault-free subgraph of an ndimensional hypercube with n c faulty components that can implement a large class of hypercube algorithms with only a constant factor slowdown. To the best of our knowledge, this result is the first in which a hypercube can tolerate more than O(n) faults in the worst case sense. 1 I n t r o d u c t i o n The n-dimensional hypercube (n-cube) is one of the most popular interc0nnection topologies for parallel computers. Hypercube-based parallel machines are built and sold commercially, and it is expected that they will continue to play an important role in the future. Permission to copy without fee all or part of this material is granted providad that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the rifle of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. One of the most impor tant issues related to such parallel machines is how they can compute in the presence of faulty components. In the following discussion, we will view a parallel computer as a graph in which the nodes correspond to processors and the edges correspond to communication finks. The issue of computing with faulty hypercubes was addressed in several recent papers [1]-[7], [10],[11],[14 ]. Most notable is the result by Hastad, Leighton and Newman [7]. They considered a faulty hypercube in which every node is faulty with constant probability p < 1 and the faults are independently distributed. They proved that, with high probability, the faulty hypetcube can simulate a fault-free hypercube with only a constant factor slowdown. Thus the hypercube is extremely tolerant of randomly distributed faults. In this paper, we examine the fault tolerance of the hypercube with respect to a worst case distribution of faults. Tha t is, we demonstrate the capabilities that a faulty hypercube is guaranteed to possess, regardless of the distribution of faults. A few researchers have studied the issue of fault-tolerance in the hypercube in a worst case scenario [3,4,5,10,11]. For example, the problem of finding a large fault-free subcube in a hypercube with faulty nodes has been studied [3,11]. The idea here is to use this fault-free subcube to simulate the whole hypercube. However, it has been shown that in order to guarantee a constant factor slowdown, the n-cube must have only O(log n ) fau l t s [3,11]. We present two results for hypercubes with worst case faults. The first result (Section 2) concerns hypercubes in which only the edges are faulty. In this case we would like to simulate the faulty hypercube with a constant factor slowdown in communication and no slowdown in computation. In order to achieve this goal, we use the tool of graph embeddings [2,4,12]. Out approach is to take a topology that has many of the capabilities of the © 1990 ACM 089791-370-1/90/0007/0037 $1.50 8~ hypercube and embed it in the hypercube without using the faulty edges. In particular, we consider the Cube Connected Cycles (CCC) graph, which is known to be able to implement ascend-descend algorithms (such as bitonic sort) [13] with only a constant factor slowdown relative to the hypercube. We have devised a sequential algorithm which, given a list of faulty edges, computes a fault-free embedding of the CCC in the faulty hypercube. The algorithm is optimal in both running time and in the number of faults that it can tolerate. Specifically, in O(n) time it can find a fault-free CCC with 2 '~ nodes that is a subgraph of an n-cube with n 3 faulty edges. We also consider several extensions to this result, including an optimal algorithm for embedding meshes in hypercubes with faulty edges. The second half of this paper (Section 3) examines hypercubes with faulty edges and nodes. Here we consider the class of "weak hypercube algorithms" in which each node sends and receives at most one message in unit time. A natural question is whether or not we can tolerate more than O(n) faults in an n-cube. We have obtained a positive answer to this question by showing how for any constant c, an n-cube that has O(n c) faulty components (nodes or edges) can implement any weak hypercube algorithm with only a constant factor slowdown. This is the first result that we know of in which an n-cube can tolerate more than O(n) worst case faults. The key to our approach is the fact that for any n c faults (c a constant) we can partition the n-cube into subcubes of constant dimension, such that every subcube has a connected component of healthy nodes of more than half its size. This is a generalization of the result given by Chan and Lee [51 for the case c = 1. For example, if there are i~n 2 faulty components, we can partition the n-cube into 4-cubes such that each has a fault-free connected component with at least 9 nodes. 2 Edge Faults and Graph Embeddings In this section we present a novel use of graph embeddings for fault tolerance. The emphasis here is on worst case edge faults and on the utilization of all the nodes. Our approach is to embed a fault-free Cube-Connected Cycles (CCC) graph in the faulty hypercube, which spans all the nodes. Ascend-descend algorithms (such as bitonic sort) [13] can then be automatically translated to run on the CCC with only a moderate constant factor slowdown relative to their running time on the fault-free hypercube. Note that the slowdown affects only communication operations. Our main result here is an optimal algorithm that computes a fault-free embedding, which can tolerate the maximum number of faults. More specifically, we give a sequential algorithm for the following problem, whose running time is linear in the dimension of the hypercube. INPUT n; set F of faulty edges in the n-cube such