Probabilistic fault localization in communication systems using belief networks

We apply Bayesian reasoning techniques to perform fault localization in complex communication systems while using dynamic, ambiguous, uncertain, or incorrect information about the system structure and state. We introduce adaptations of two Bayesian reasoning techniques for polytrees, iterative belief updating, and iterative most probable explanation. We show that these approximate schemes can be applied to belief networks of arbitrary shape and overcome the inherent exponential complexity associated with exact Bayesian reasoning. We show through simulation that our approximate schemes are almost optimally accurate, can identify multiple simultaneous faults in an event driven manner, and incorporate both positive and negative information into the reasoning process. We show that fault localization through iterative belief updating is resilient to noise in the observed symptoms and prove that Bayesian reasoning can now be used in practice to provide effective fault localization.

[1]  Rajeev Rastogi,et al.  Topology discovery in heterogeneous IP networks , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[2]  Andy Bierman,et al.  Physical Topology MIB , 2000, RFC.

[3]  Mischa Schwartz,et al.  Schemes for fault identification in communication networks , 1995, TNET.

[4]  B. Dang,et al.  Interconnections, second edition: bridges, routers, switches, and internetworking protocols [Bookshelf] , 2000, IEEE Software.

[5]  Yossi A. Nygate,et al.  Event correlation using rule and object based techniques , 1995, Integrated Network Management.

[6]  Judea Pearl,et al.  Bayesian Networks , 1998, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[7]  C. S. Chao,et al.  An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation , 2004, Journal of Network and Systems Management.

[8]  Li Hua-feng,et al.  On Simple Network Management Protocol (SNMP) , 2004 .

[9]  Thomas R. Gross,et al.  Topology discovery for large ethernet networks , 2001, SIGCOMM '01.

[10]  Keith McCloghrie,et al.  Protocol Operations for version 2 of the Simple Network Management Protocol (SNMPv2) , 1993, RFC.

[11]  Rajeev Gopal,et al.  Layered model for supporting fault isolation and recovery , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[12]  Robert H. Deng,et al.  A Probabilistic Approach to Fault Diagnosis in Linear Lightware Networks , 1993, IEEE J. Sel. Areas Commun..

[13]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[14]  Jung-Fu Cheng,et al.  Turbo Decoding as an Instance of Pearl's "Belief Propagation" Algorithm , 1998, IEEE J. Sel. Areas Commun..

[15]  Marcos Novaes,et al.  Beacon: A Hierarchical Network Topology Monitoring System Based on IP Multicast , 2000, DSOM.

[16]  Padhraic Smyth,et al.  Markov monitoring with unknown states , 1994, IEEE J. Sel. Areas Commun..

[17]  Srinivas Ramanathan,et al.  Auto-Discovery Capabilities for Service Management: An ISP Case Study , 2004, Journal of Network and Systems Management.

[18]  Mariusz A. Fecko,et al.  Combinatorial designs in multiple faults localization for battlefield networks , 2001, 2001 MILCOM Proceedings Communications for Network-Centric Operations: Creating the Information Force (Cat. No.01CH37277).

[19]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[20]  Michael Luby,et al.  Approximating Probabilistic Inference in Bayesian Belief Networks is NP-Hard , 1993, Artif. Intell..

[21]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[22]  Malgorzata Steinder,et al.  Increasing robustness of fault localization through analysis of lost, spurious, and positive symptoms , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[23]  A. Glavieux,et al.  Near Shannon limit error-correcting coding and decoding: Turbo-codes. 1 , 1993, Proceedings of ICC '93 - IEEE International Conference on Communications.

[24]  Mischa Schwartz,et al.  Identification of Faulty Links in Dynamic-Routed Networks , 1993, IEEE J. Sel. Areas Commun..

[25]  Guangtian Liu,et al.  Composite events for network event correlation , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[26]  Stefan Kiitker A Modeling Framework for Integrated Distributed Systems Fault Management , 1996 .

[27]  Marshall T. Rose,et al.  Management Information Base for network management of TCP/IP-based internets , 1990, RFC.

[28]  Ramesh Govindan,et al.  Heuristics for Internet map discovery , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[29]  D. Zager,et al.  Value-oriented network management , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[30]  Rina Dechter,et al.  Bucket elimination: A unifying framework for probabilistic inference , 1996, UAI.

[31]  Keith McCloghrie,et al.  Definitions of Managed Objects for Bridges , 1993, RFC.

[32]  Thomas R. Gross,et al.  Topology discovery for large ethernet networks , 2001, SIGCOMM 2001.

[33]  Lundy M. Lewis,et al.  A Case-Based Reasoning Approach to the Resolution of Faults in Communication Networks , 1993, Integrated Network Management.

[34]  Rina Dechter,et al.  A Scheme for Approximating Probabilistic Inference , 1997, UAI.

[35]  G. Jakobson,et al.  Alarm correlation , 1993, IEEE Network.

[36]  Peng Wu,et al.  Alarm correlation engine (ACE) , 1998, NOMS 98 1998 IEEE Network Operations and Management Symposium.

[37]  Malgorzata Steinder,et al.  End-to-end service failure diagnosis using belief networks , 2002, NOMS 2002. IEEE/IFIP Network Operations and Management Symposium. ' Management Solutions for the New Communications World'(Cat. No.02CH37327).

[38]  Martin Paterok,et al.  Event Correlation in Heterogeneous Networks Using the OSI Management Framework , 1993, Integrated Network Management.

[39]  Ramesh Viswanathan,et al.  A conceptual framework for network management event correlation and filtering systems , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[40]  Salvatore J. Stolfo,et al.  A coding approach to event correlation , 1995, Integrated Network Management.

[41]  R. Dechter,et al.  On the impact of causal independence , 1998 .

[42]  Keith McCloghrie,et al.  Definitions of Managed Objects for Bridges , 1991, RFC.

[43]  Bernard Pagurek,et al.  Network diagnosis by reasoning in uncertain nested evidence spaces , 1995, IEEE Trans. Commun..

[44]  Malgorzata Steinder,et al.  Non-deterministic fault localiza-tion in communication systems using belief networks , 2002 .

[45]  Alexander Keller,et al.  Managing application services over service provider networks: architecture and dependency analysis , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[46]  Radia Perlman,et al.  Interconnections: Bridges, Routers, Switches, and Internetworking Protocols , 1999 .

[47]  Chuanyi Ji,et al.  Proactive network fault detection , 1997, Proceedings of INFOCOM '97.

[48]  Malgorzata Steinder,et al.  Distributed Fault Localization in Hierarchically Routed Networks , 2002, DSOM.

[49]  Yves Raynaud,et al.  Integrated Network Management IV , 1995, IFIP — The International Federation for Information Processing.