Overview of fault handling for the chaos router

The chaos router is an adaptive nonminimal message router for multicomputers that is simple enough to compete with the fast, oblivious routers now in use in commercial machines. It improves on previous adaptive routers by using randomization, which eliminates the need for complex livelock protection and speeds the router. This randomization, however, greatly complicates the fault detection because there is no worstcase bound on the time required to deliver a message. Distinguishing between lost and very slow messages is difficult. A new method of fault detection is presented that applies not only to the chaos router but also to other adaptive routers as well. In addition, solutions to several practical fault diagnosis and recovery problems in the chaos router are presented. The presentation supports the claim that fault tolerance can be incorporated into a practical router without harming performance for the normal, fault-free cases.<<ETX>>

[1]  Lawrence Snyder,et al.  The chaos router: a practical application of randomization in network routing , 1990, SPAA '90.

[2]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[3]  S. Konstantinidou,et al.  Chaos router: architecture and performance , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[4]  J. Y. Ngai,et al.  A framework for adaptive routing in multicomputer networks , 1989, CARN.