Software-based fault-tolerant routing algorithm in multidimensional networks

Massively parallel computing systems are being built with hundreds or thousands of components such as nodes, links, memories, and connectors. The failure of a component in such systems will not only reduce the computational power but also alter the network's topology. The software-based fault-tolerant routing algorithm is a popular routing to achieve fault-tolerance capability in networks. This algorithm is initially proposed only for two dimensional networks (Suh et al., 2000). Since, higher dimensional networks have been widely employed in many contemporary massively parallel systems; this paper proposes an approach to extend this routing scheme to these indispensable higher dimensional networks. Deadlock and livelock freedom and the performance of presented algorithm, have been investigated for networks with different dimensionality and various fault regions. Furthermore, performance results have been presented through simulation experiments

[1]  Patrick W. Dowd,et al.  High speed routing in a parallel processing environment: a simulation study , 1991, ANSS '91.

[2]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[3]  Young-Joo Suh,et al.  Software-Based Rerouting for Fault-Tolerant Pipelined Communication , 2000, IEEE Trans. Parallel Distributed Syst..

[4]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[5]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[6]  José Duato,et al.  A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks , 1993, IEEE Trans. Parallel Distributed Syst..

[7]  Antonio Robles,et al.  A New Adaptive Fault-Tolerant Routing Methodology for Direct Networks , 2004, HiPC.

[8]  Antonio Robles,et al.  A transition-based fault-tolerant routing methodology for InfiniBand networks , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[9]  Antonio Robles,et al.  An effective fault-tolerant routing methodology for direct networks , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[10]  Djibo Karimou,et al.  A Fault-Tolerant Permutation Routing Algorithm in Mobile Ad-Hoc Networks , 2005, ICN.

[11]  William J. Dally,et al.  Virtual-channel flow control , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[12]  Suresh Chalasani,et al.  Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks , 1995, IEEE Trans. Computers.

[13]  Mohamed F. Younis,et al.  Fault-tolerant clustering of wireless sensor networks , 2003, 2003 IEEE Wireless Communications and Networking, 2003. WCNC 2003..

[14]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .