Automatic fault detection and recovery in real time switched Ethernet networks

EtheReal is a real-time fast Ethernet switch architecture that provides bandwidth guarantees to distributed multimedia applications without OS or hardware modifications on the host machines. It implements true link-layer multicast, and offers a natural match to support network-layer QoS protocols such as RSVP. Because real-time performance guarantees fundamentally require state to be installed inside the network, link/switch failures could lead to significant disruption to the QoS promised to the user applications. This paper describes the fault detection and recovery mechanism supported by the EtheReal architecture, and reports on the performance measurements of the initial prototype implementation. The heart of EtheReal's fault detection and recovery mechanism is a fast spanning tree reconfiguration algorithm to reduce the total fault recovery time, and a delayed link inactivation scheme that allows real-time connections which are not affected by the failed links/switches to continue to exist, even though some of the links are marked as "blocked" in the new spanning tree topology. Measurements on the prototype show that the fault detection and recovery time on a network whose diameter is 10 hops are 220 ms and 31 ms, respectively. This combined delay corresponds to a minor jitter in real-time audio/video communication, and is a significant improvement over the standard IEEE 802.1d implementation, which takes on the order of 30 sec.

[1]  A. Banerjea Simulation Study of the Capacity Effects of Dispersity Routing for Fault Tolerant Realtime Channels , 1996, SIGCOMM.

[2]  Kang G. Shin,et al.  Fast restoration of real-time communication service from component failures in multi-hop networks , 1997, SIGCOMM '97.

[3]  G ShinKang,et al.  Fast restoration of real-time communication service from component failures in multi-hop networks , 1997 .

[4]  Parameswaran Ramanathan,et al.  Resource aggregation for fault tolerance in integrated services networks , 1998, CCRV.

[5]  Subrahmanyam Dravida,et al.  Fast restoration of ATM networks , 1994, IEEE J. Sel. Areas Commun..

[6]  Deborah Estrin,et al.  RSVP: a new resource ReSerVation Protocol , 1993 .

[7]  D. Estrin,et al.  RSVP: a new resource reservation protocol , 1993, IEEE Communications Magazine.

[8]  Radia Perlman Interconnections: Bridges and Routers , 1992 .

[9]  Lixia Zhang,et al.  Resource ReSerVation Protocol (RSVP) - Version 1 Functional Specification , 1997, RFC.

[10]  Anindo Banerjea,et al.  Recovering guaranteed performance service connections from single and multiple faults , 1994, 1994 IEEE GLOBECOM. Communications: The Global Bridge.

[11]  Tzi-cker Chiueh,et al.  Fault handling mechanisms in the RETHER protocol , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[12]  Michael Burrows,et al.  Autonet: A High-Speed, Self-Configuring Local Area Network Using Point-to-Point Links , 1991, IEEE J. Sel. Areas Commun..

[13]  Israel Cidon,et al.  Paris: An approach to integrated high‐speed private networks , 1988 .

[14]  Baruch Awerbuch,et al.  Distributed control for PARIS , 1990, PODC '90.

[15]  Tzi-cker Chiueh,et al.  EtheReal: a host-transparent real-time Fast Ethernet switch , 1998, Proceedings Sixth International Conference on Network Protocols (Cat. No.98TB100256).

[16]  E. Douglas Jensen Distributed Control , 1980, Advanced Course: Distributed Systems.

[17]  Michael D. Schroeder,et al.  Automatic reconfiguration in Autonet , 1991, SOSP '91.