Characterization of failures in an IP backbone

We analyze IS-IS routing updates from sprint's IP network to characterize failures that affect IP connectivity. Failures are first classified based on probable causes such as maintenance activities, router-related and optical layer problems. Key temporal and spatial characteristics of each class are analyzed and, when appropriate, parameterized using well-known distributions. Our results indicate that 20% of all failures is due to planned maintenance activities. Of the unplanned failures, almost 30% are shared by multiple links and can be attributed to router-related and optical equipment-related problems, while 70% affect a single link at a time. Our classification of failures according to different causes reveals the nature and extent of failures in today's IP backbones. Furthermore, our characterization of the different classes can be used to develop a probabilistic failure model, which is important for various traffic engineering problems.

[1]  Cengiz Alaettinoglu,et al.  Detailed Analysis of ISIS Routing Protocol on the Qwest Backbone , 2002 .

[2]  Luca Valcarenghi,et al.  IP restoration vs. WDM protection: is there an optimal choice? , 2000, IEEE Netw..

[3]  Richard E. Barlow,et al.  Statistical Analysis of Reliability and Life Testing Models , 1975 .

[4]  David R. Oran,et al.  OSI IS-IS Intra-domain Routing Protocol , 1990, RFC.

[5]  Srihari Nelakuditi,et al.  Failure insensitive routing for ensuring service availability , 2003, IWQoS'03.

[6]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[7]  Chen-Nee Chuah,et al.  Analysis of link failures in an IP backbone , 2002, IMW '02.

[8]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[9]  Anja Feldmann,et al.  Dynamics of IP traffic: a study of the role of variability and the impact of control , 1999, SIGCOMM '99.

[10]  Bianca Schroeder,et al.  IGP link weight assignment for transient link failures , 2003 .

[11]  Konstantina Papagiannaki,et al.  Measurement and analysis of single-hop delay on an IP backbone network , 2003, IEEE J. Sel. Areas Commun..

[12]  Armando Fox When Does Fast Recovery Trump High Reliability , 2002 .

[13]  Christophe Diot,et al.  Network availability based service differentiation , 2003, IWQoS'03.

[14]  Chase Cotton,et al.  Packet-level traffic measurements from the Sprint IP backbone , 2003, IEEE Netw..

[15]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[16]  Christophe Diot,et al.  Impact of link failures on VoIP performance , 2002, NOSSDAV '02.

[17]  Chen-Nee Chuah,et al.  Feasibility of IP restoration in a tier 1 backbone , 2004, IEEE Network.

[18]  Eric A. Brewer,et al.  Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..

[19]  Lada A. Adamic Zipf, Power-laws, and Pareto-a ranking tutorial , 2000 .

[20]  V. Paxson End-to-end routing behavior in the internet , 2006, CCRV.

[21]  Yin Zhang,et al.  The Stationarity of Internet Path Properties: Routing, Loss, and Throughput , 2000 .

[22]  Christophe Diot,et al.  An approach to alleviate link overload as observed on an IP backbone , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[23]  Fouad A. Tobagi,et al.  Provisioning IP backbone networks to support latency sensitive traffic , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[24]  Mikkel Thorup,et al.  Optimizing OSPF/IS-IS weights in a changing world , 2002, IEEE J. Sel. Areas Commun..

[25]  Biswanath Mukherjee,et al.  Fault management in IP-over-WDM networks: WDM protection versus IP restoration , 2002, IEEE J. Sel. Areas Commun..

[26]  Michael Dahlin,et al.  End-to-end WAN service availability , 2001, TNET.