End-to-end service failure diagnosis using belief networks

We present fault localization techniques suitable for diagnosing end-to-end service problems in communication systems with complex topologies. We refine a layered system model that represents relationships between services and functions offered between neighboring protocol layers. In a given layer, an end-to-end service between two hosts may be provided using multiple host-to-host services offered in this layer between two hosts on the end-to-end path. Relationships among end-to-end and host-to-host services form a bipartite probabilistic dependency graph whose structure depends on the network topology in the corresponding protocol layer. When an end-to-end service fails or experiences performance problems it is important to efficiently find the responsible host-to-host services. Finding the most probable explanation (MPE) of the observed symptoms is NP-hard. We propose two fault localization techniques based on Pearl's (1988) iterative algorithms for singly connected belief networks. The probabilistic dependency graph is transformed into a belief network, and then the approximations based on Pearl's algorithms and exact bucket tree elimination algorithm are designed and evaluated through extensive simulation study.

[1]  Malgorzata Steinder,et al.  Increasing robustness of fault localization through analysis of lost, spurious, and positive symptoms , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[2]  Yossi A. Nygate,et al.  Event correlation using rule and object based techniques , 1995, Integrated Network Management.

[3]  B. Dang,et al.  Interconnections, second edition: bridges, routers, switches, and internetworking protocols [Bookshelf] , 2000, IEEE Software.

[4]  Mariusz A. Fecko,et al.  Combinatorial designs in multiple faults localization for battlefield networks , 2001, 2001 MILCOM Proceedings Communications for Network-Centric Operations: Creating the Information Force (Cat. No.01CH37277).

[5]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[6]  Mischa Schwartz,et al.  Schemes for fault identification in communication networks , 1995, TNET.

[7]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[8]  Yves Raynaud,et al.  Integrated Network Management IV , 1995, IFIP — The International Federation for Information Processing.

[9]  Ramesh Viswanathan,et al.  A conceptual framework for network management event correlation and filtering systems , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[10]  Michael Luby,et al.  Approximating Probabilistic Inference in Bayesian Belief Networks is NP-Hard , 1993, Artif. Intell..

[11]  Radia Perlman,et al.  Interconnections: Bridges, Routers, Switches, and Internetworking Protocols , 1999 .

[12]  Michael P. Wellman,et al.  Bayesian networks , 1995, CACM.

[13]  Peng Wu,et al.  Alarm correlation engine (ACE) , 1998, NOMS 98 1998 IEEE Network Operations and Management Symposium.

[14]  Adarshpal S. Sethi,et al.  Multi-layer Fault Localization Using Probabilistic Inference in Bipartite Dependency Graphs , 2001 .

[15]  Rajeev Gopal,et al.  Layered model for supporting fault isolation and recovery , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[16]  Robert H. Deng,et al.  A Probabilistic Approach to Fault Diagnosis in Linear Lightware Networks , 1993, IEEE J. Sel. Areas Commun..

[17]  Jung-Fu Cheng,et al.  Turbo Decoding as an Instance of Pearl's "Belief Propagation" Algorithm , 1998, IEEE J. Sel. Areas Commun..

[18]  Malgorzata Steinder,et al.  Non-deterministic diagnosis of end-to-end service failures in a multi-layer communication system , 2001, Proceedings Tenth International Conference on Computer Communications and Networks (Cat. No.01EX495).

[19]  Salvatore J. Stolfo,et al.  A coding approach to event correlation , 1995, Integrated Network Management.

[20]  Keith McCloghrie,et al.  Definitions of Managed Objects for Bridges , 1991, RFC.

[21]  Rina Dechter,et al.  Bucket elimination: A unifying framework for probabilistic inference , 1996, UAI.

[22]  Mischa Schwartz,et al.  Identification of Faulty Links in Dynamic-Routed Networks , 1993, IEEE J. Sel. Areas Commun..

[23]  KatzelaIrene,et al.  Schemes for fault identification in communication networks , 1995 .

[24]  Guangtian Liu,et al.  Composite events for network event correlation , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[25]  A. Glavieux,et al.  Near Shannon limit error-correcting coding and decoding: Turbo-codes. 1 , 1993, Proceedings of ICC '93 - IEEE International Conference on Communications.

[26]  G. Jakobson,et al.  Alarm correlation , 1993, IEEE Network.

[27]  Keith McCloghrie,et al.  Definitions of Managed Objects for Bridges , 1993, RFC.

[28]  Rina Dechter,et al.  A Scheme for Approximating Probabilistic Inference , 1997, UAI.

[29]  Martin Paterok,et al.  Event Correlation in Heterogeneous Networks Using the OSI Management Framework , 1993, Integrated Network Management.

[30]  Keith McCloghrie,et al.  Protocol Operations for version 2 of the Simple Network Management Protocol (SNMPv2) , 1993, RFC.