Increasing robustness of fault localization through analysis of lost, spurious, and positive symptoms

This paper utilizes belief networks to implement fault localization in communication systems taking into account comprehensive information about the system behavior. Most previous work on this subject performs fault localization based solely on the information about malfunctioning system components (i.e., negative symptoms). We show that positive information, i.e., the lack of any disorder in some system components, may be used to improve the accuracy of this process. The technique presented allows lost and spurious symptoms to be incorporated in the analysis. We show through simulation that in a noisy network environment the analysis of lost and spurious symptoms increases the robustness of fault localization with belief networks. We also demonstrate that belief networks yield high accuracy even for approximate probability input data and therefore are a promising model for non-deterministic fault localization.

[1]  Yossi A. Nygate,et al.  Event correlation using rule and object based techniques , 1995, Integrated Network Management.

[2]  Mariusz A. Fecko,et al.  Combinatorial designs in multiple faults localization for battlefield networks , 2001, 2001 MILCOM Proceedings Communications for Network-Centric Operations: Creating the Information Force (Cat. No.01CH37277).

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[4]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[5]  Malgorzata Steinder,et al.  Yemanja-a layered event correlation engine for multi-domain server farms , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[6]  Ramesh Viswanathan,et al.  A conceptual framework for network management event correlation and filtering systems , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[7]  Malgorzata Steinder,et al.  Non-deterministic diagnosis of end-to-end service failures in a multi-layer communication system , 2001, Proceedings Tenth International Conference on Computer Communications and Networks (Cat. No.01EX495).

[8]  Rina Dechter,et al.  Bucket elimination: A unifying framework for probabilistic inference , 1996, UAI.

[9]  Keith McCloghrie,et al.  Protocol Operations for version 2 of the Simple Network Management Protocol (SNMPv2) , 1993, RFC.

[10]  Yves Raynaud,et al.  Integrated Network Management IV , 1995, IFIP — The International Federation for Information Processing.

[11]  Mischa Schwartz,et al.  Schemes for fault identification in communication networks , 1995, TNET.

[12]  Seraphin B. Calo,et al.  Alarm correlation and fault identification in communication networks , 1994, IEEE Trans. Commun..

[13]  B. Dang,et al.  Interconnections, second edition: bridges, routers, switches, and internetworking protocols [Bookshelf] , 2000, IEEE Software.

[14]  S. Katker A modeling framework for integrated distributed systems fault management , 1996 .

[15]  Mischa Schwartz,et al.  Identification of Faulty Links in Dynamic-Routed Networks , 1993, IEEE J. Sel. Areas Commun..

[16]  Guangtian Liu,et al.  Composite events for network event correlation , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[17]  Michael I. Jordan,et al.  Probabilistic Networks and Expert Systems , 1999 .

[18]  Michael Luby,et al.  Approximating Probabilistic Inference in Bayesian Belief Networks is NP-Hard , 1993, Artif. Intell..

[19]  Rajeev Gopal,et al.  Layered model for supporting fault isolation and recovery , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[20]  Robert H. Deng,et al.  A Probabilistic Approach to Fault Diagnosis in Linear Lightware Networks , 1993, IEEE J. Sel. Areas Commun..

[21]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[22]  Jung-Fu Cheng,et al.  Turbo Decoding as an Instance of Pearl's "Belief Propagation" Algorithm , 1998, IEEE J. Sel. Areas Commun..

[23]  Michael P. Wellman,et al.  Bayesian networks , 1995, CACM.

[24]  Peng Wu,et al.  Alarm correlation engine (ACE) , 1998, NOMS 98 1998 IEEE Network Operations and Management Symposium.

[25]  G. Jakobson,et al.  Alarm correlation , 1993, IEEE Network.

[26]  Salvatore J. Stolfo,et al.  A coding approach to event correlation , 1995, Integrated Network Management.

[27]  Bernard Pagurek,et al.  Network diagnosis by reasoning in uncertain nested evidence spaces , 1995, IEEE Trans. Commun..

[28]  Brendan J. Frey,et al.  Iterative Decoding of Compound Codes by Probability Propagation in Graphical Models , 1998, IEEE J. Sel. Areas Commun..

[29]  Malgorzata Steinder,et al.  The present and future of event correlation: A need for end-to-end service fault localization , 2001 .

[30]  Lundy M. Lewis,et al.  A Case-Based Reasoning Approach to the Resolution of Faults in Communication Networks , 1993, Integrated Network Management.

[31]  Malgorzata Steinder,et al.  End-to-end service failure diagnosis using belief networks , 2002, NOMS 2002. IEEE/IFIP Network Operations and Management Symposium. ' Management Solutions for the New Communications World'(Cat. No.02CH37327).

[32]  Martin Paterok,et al.  Event Correlation in Heterogeneous Networks Using the OSI Management Framework , 1993, Integrated Network Management.

[33]  Radia Perlman,et al.  Interconnections: Bridges, Routers, Switches, and Internetworking Protocols , 1999 .