Design and analysis of fault-tolerant pipelined multicomputer networks

Parallel architectures rely on fast inter-processor communication to exploit concurrency in computational tasks. Low message latency and high network throughput are necessary to exploit parallelism at increasingly smaller granularities. However, as networks get larger, the probability of component failures increase as well. We would like such a system to continue to operate correctly in the presence of such failures. In addition, we desire a graceful degradation of performance as the number of failures increases. The objective of this dissertation is to develop a framework for the design and analysis of fault-tolerant pipelined interconnection networks. Solutions to the problems of both static and dynamic link faults are presented. Flexible variants--pipelined circuit-switching (PCS) and acknowledged pipelined circuit-switching (APCS)--of the common wormhole routing communication mechanism are proposed for fault-tolerant communication support. In addition, an adaptive routing algorithm that utilizes misrouting and backtracking (MB-m) is presented and shown to provide good fault-tolerant properties. A dynamic fault recovery mechanism is also presented that gives deadlock-free network operation even in the presence of dynamic link failures. An analytical model of network performance is developed and used to provide insight into various facets of router design. Finally, a hardware implementation of PCS and MB-m is presented to demonstrate the viability of these concepts.