IP fault localization via risk modeling

Automated, rapid, and effective fault management is a central goal of large operational IP networks. Today's networks suffer from a wide and volatile set of failure modes, where the underlying fault proves difficult to detect and localize, thereby delaying repair. One of the main challenges stems from operational reality: IP routing and the underlying optical fiber plant are typically described by disparate data models and housed in distinct network management systems. We introduce a fault-localization methodology based on the use of risk models and an associated troubleshooting system, SCORE (Spatial Correlation Engine), which automatically identifies likely root causes across layers. In particular, we apply SCORE to the problem of localizing link failures in IP and optical networks. In experiments conducted on a tier-1 ISP backbone, SCORE proved remarkably effective at localizing optical link failures using only IP-layer event logs. Moreover, SCORE was often able to automatically uncover inconsistencies in the databases that maintain the critical associations between the IP and optical networks.

[1]  Malgorzata Steinder,et al.  Increasing robustness of fault localization through analysis of lost, spurious, and positive symptoms , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[2]  Peter Fröhlich,et al.  Using Neural Networks for Alarm Correlation in Cellular Phone Networks , 1999 .

[3]  Michael A. West,et al.  Bayesian Inference on Network Traffic Using Link Count Data , 1998 .

[4]  Stewart E. Miller,et al.  Optical Fiber Telecommunications , 1979 .

[5]  Guangtian Liu,et al.  Composite events for network event correlation , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[6]  Jennifer Yates,et al.  Effectiveness of shared risk link group auto-discovery in optical networks , 2002, Optical Fiber Communication Conference and Exhibit.

[7]  Christophe Diot,et al.  Traffic matrix estimation: existing techniques and new directions , 2002, SIGCOMM 2002.

[8]  C.S. Chao,et al.  An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation , 1999, Proceedings 1999 IEEE Workshop on Internet Applications (Cat. No.PR00197).

[9]  George Forman,et al.  Automated Whole-System Diagnosis of Distributed Services Using Model-Based Reasoning , 1998 .

[10]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[11]  Albert G. Greenberg,et al.  OSPF Monitoring: Architecture, Design, and Deployment Experience , 2004, NSDI.

[12]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[13]  Michael I. Jordan,et al.  A statistical learning approach to failure diagnosis , 2004 .

[14]  Angela Chiu,et al.  Issues for routing in the optical layer , 2001, IEEE Commun. Mag..

[15]  Albert G. Greenberg,et al.  Combining routing and traffic data for detection of IP forwarding anomalies , 2004, SIGMETRICS '04/Performance '04.

[16]  Carsten Lund,et al.  An information-theoretic approach to traffic matrix estimation , 2003, SIGCOMM '03.

[17]  Seraphin B. Calo,et al.  Towards a practical alarm correlation system , 1995, Integrated Network Management.

[18]  Kireeti Kompella,et al.  LDP failure detection and recovery , 2004, IEEE Communications Magazine.

[19]  Mikael Johansson,et al.  Traffic matrix estimation on a large IP backbone: a comparison on real data , 2004, IMC '04.

[20]  Peng Wu,et al.  Alarm correlation engine (ACE) , 1998, NOMS 98 1998 IEEE Network Operations and Management Symposium.

[21]  Jeffrey D. Case,et al.  Simple Network Management Protocol (SNMP) , 1990, RFC.

[22]  Robert H. Deng,et al.  A Probabilistic Approach to Fault Diagnosis in Linear Lightware Networks , 1993, IEEE J. Sel. Areas Commun..

[23]  Albert G. Greenberg,et al.  Fast accurate computation of large-scale IP traffic matrices from link loads , 2003, SIGMETRICS '03.

[24]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[25]  G. Jakobson,et al.  Alarm correlation , 1993, IEEE Network.

[26]  Matthew Roughan,et al.  IP forwarding anomalies and improving their detection using multiple data sources , 2004, NetT '04.

[27]  Jeffrey D. Case,et al.  Simple Network Management Protocol (SNMP) , 1989, RFC.

[28]  John Moy,et al.  OSPF Version 2 , 1998, RFC.

[29]  David S. Johnson,et al.  Approximation algorithms for combinatorial problems , 1973, STOC.

[30]  Kumar N. Sivarajan,et al.  Optical Networks: A Practical Perspective , 1998 .

[31]  Boris Gruschke,et al.  INTEGRATED EVENT MANAGEMENT: EVENT CORRELATION USING DEPENDENCY GRAPHS , 1998 .

[32]  Y. Vardi,et al.  Network Tomography: Estimating Source-Destination Traffic Intensities from Link Data , 1996 .

[33]  Paul Barford,et al.  Improving accuracy in end-to-end packet loss measurement , 2005, SIGCOMM '05.

[34]  Dino Farinacci,et al.  Generic Routing Encapsulation over IPv4 networks , 1994, RFC.

[35]  Yossi A. Nygate,et al.  Event correlation using rule and object based techniques , 1995, Integrated Network Management.

[36]  Christian E. Hopps,et al.  Analysis of an Equal-Cost Multi-Path Algorithm , 2000, RFC.

[37]  Malgorzata Steinder,et al.  End-to-end service failure diagnosis using belief networks , 2002, NOMS 2002. IEEE/IFIP Network Operations and Management Symposium. ' Management Solutions for the New Communications World'(Cat. No.02CH37327).

[38]  Antonio Nucci,et al.  Design of IGP Link Weights for Estimation of Traffic Matrices. , 2004, INFOCOM 2004.

[39]  Roberto Manione,et al.  An Expert System for Real Time Fault Diagnosis of the Italian Telecommunications Network , 1993, Integrated Network Management.

[40]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..