Design of fault-tolerant distributed memory multiprocessors

This dissertation utilizes the capabilities of the circuit-switched communication to present reconfiguration schemes to make multiprocessors based on the hypercube, the k-ary n-cube, and the k-ary tree operational in the presence of faulty processor nodes and/or faulty communication links. We require that the resultant reconfiguration cause no modification to the existing communication or computation algorithms. Our first hardware redundancy scheme is called the cluster approach. The approach assigns one spare node to each subset of regular nodes called a cluster. Local reconfiguration is used to replace the faulty components with spares. Our simulation results indicate that the approach tolerates moderate number of faults. In our second hardware redundancy scheme, called the enhanced cluster approach, spare nodes of neighboring clusters are interconnected as well. By utilizing the circuit-switched capabilities of the spare nodes' communication modules, multiple faulty nodes per cluster are tolerated. Our theoretical and simulation results indicate that the approach tolerates significantly more faults than other proposed schemes in the literature. To allow real-time fault tolerance, the two-stage redundant scheme is proposed. The scheme uses global reconfiguration algorithm to assign a non-local spare node to a faulty node when the task completion deadline is soft and utilizes local reconfiguration to assign a local spare node to a faulty node when the completion deadline hard. In case there is no hardware redundancy, a graceful degradable approach is presented to reconfigure a faulty d-dimensional hypercube. The approach is optimal since it can always constructs a ($d -$ 1)-subcube in the presence of up to 2$\sp{(d-1)}$ faulty nodes. The approach is extended to tolerate combination of faulty nodes and faulty links. In case the number of faulty nodes of a d-dimensional enhanced cluster hypercube is more than the available spares, a graceful degradable approach is presented to sustain a ($d -$ 1)-dimensional subcube. Finally, the management of the hypercube in the presence of faulty nodes is examined and a procedure is presented to convert a faulty d-dimensional hypercube into an enhanced cluster hypercube of dimension ($d -$ 1).