Characterization of spatial fault patterns in interconnection networks

Parallel computers, such as multiprocessors system-on-chip (Mp-SoCs), multicomputers and cluster computers, are consisting of hundreds or thousands multiple processing units and components (such as routers, channels and connectors) connected via some interconnection network that collectively may undergo high failure rates. Therefore, these systems are required to be equipped with fault-tolerant mechanisms to ensure that the system will keep running in a degraded mode. Normally, the faulty components are coalesced into fault regions, which are classified into two major categories: convex and concave regions. In this paper, we propose the first solution to calculate the probability of occurrences of common fault patterns in torus and mesh interconnection networks which includes both convex (|-shaped, @?-shaped) and concave (L-shaped, T-shaped, +-shaped, H-shaped) regions. These results play a key role when studying, particularly, the performance analysis of routing algorithms proposed for interconnection networks under faulty conditions.

[1]  John Riordan,et al.  Introduction to Combinatorial Analysis , 1958 .

[2]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[3]  Laxmikant V. Kalé,et al.  A fault tolerant protocol for massively parallel systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[4]  Antonio Robles,et al.  A transition-based fault-tolerant routing methodology for InfiniBand networks , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[5]  Djibo Karimou,et al.  A Fault-Tolerant Permutation Routing Algorithm in Mobile Ad-Hoc Networks , 2005, ICN.

[6]  John Riordan,et al.  Introduction to Combinatorial Analysis , 1959 .

[7]  Mohamed F. Younis,et al.  Fault-tolerant clustering of wireless sensor networks , 2003, 2003 IEEE Wireless Communications and Networking, 2003. WCNC 2003..

[8]  Kenneth H. Rosen,et al.  Discrete Mathematics and its applications , 2000 .

[9]  Antonio Robles,et al.  A New Adaptive Fault-Tolerant Routing Methodology for Direct Networks , 2004, HiPC.

[10]  Maria E. Gomez,et al.  An effective fault-tolerant routing methodology for direct networks , 2004 .

[11]  Jie Wu,et al.  On constructing the minimum orthogonal convex polygon in 2-D faulty meshes , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[12]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[13]  Young-Joo Suh,et al.  Software-Based Rerouting for Fault-Tolerant Pipelined Communication , 2000, IEEE Trans. Parallel Distributed Syst..

[14]  Jamal N. Al-Karaki Performance analysis of repairable cluster of workstations , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[15]  W. Jevons,et al.  Choice and Chance , 1870, Nature.

[16]  Hamid Sarbazi-Azad,et al.  A class of ball-and-bin problems and its application to mesh networks , 2003, 10th IEEE International Conference on Electronics, Circuits and Systems, 2003. ICECS 2003. Proceedings of the 2003.

[17]  Suresh Chalasani,et al.  Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks , 1995, IEEE Trans. Computers.

[18]  Partha Pratim Pande,et al.  Performance evaluation and design trade-offs for network-on-chip interconnect architectures , 2005, IEEE Transactions on Computers.