Probabilistic fault diagnosis in communication systems through incremental hypothesis updating

This paper presents a probabilistic event-driven fault localization technique, which uses a probabilistic symptom-fault map as a fault propagation model. The technique isolates the most probable set of faults through incremental updating of a symptom-explanation hypothesis. At any time, it provides a set of alternative hypotheses, each of which is a complete explanation of the set of symptoms observed thus far. The hypotheses are ranked according to a measure of their goodness. The technique allows multiple simultaneous independent faults to be identified and incorporates both negative and positive symptoms in the analysis. As shown in a simulation study, the technique offers close-to-optimal accuracy and is resilient both to noise in the symptom data and to inaccuracies of the probabilistic fault propagation model.

[1]  Ramesh Viswanathan,et al.  A conceptual framework for network management event correlation and filtering systems , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[2]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[3]  Malgorzata Steinder,et al.  Yemanja—A Layered Fault Localization System for Multi-Domain Computing Utilities , 2002, Journal of Network and Systems Management.

[4]  Seraphin B. Calo,et al.  Alarm correlation and fault identification in communication networks , 1994, IEEE Trans. Commun..

[5]  Malgorzata Steinder,et al.  Non-deterministic fault localiza-tion in communication systems using belief networks , 2002 .

[6]  Malgorzata Steinder,et al.  Distributed Fault Localization in Hierarchically Routed Networks , 2002, DSOM.

[7]  Yves Raynaud,et al.  Integrated Network Management IV , 1995, IFIP — The International Federation for Information Processing.

[8]  Rajeev Gopal,et al.  Layered model for supporting fault isolation and recovery , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[9]  Salvatore J. Stolfo,et al.  A coding approach to event correlation , 1995, Integrated Network Management.

[10]  Keith McCloghrie,et al.  Protocol Operations for version 2 of the Simple Network Management Protocol (SNMPv2) , 1993, RFC.

[11]  Judea Pearl,et al.  Bayesian Networks , 1998, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[12]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[13]  Mischa Schwartz,et al.  Schemes for fault identification in communication networks , 1995, TNET.

[14]  Sheng Ma,et al.  Optimizing Probe Selection for Fault Localization , 2001, DSOM.

[15]  Malgorzata Steinder,et al.  Non-deterministic diagnosis of end-to-end service failures in a multi-layer communication system , 2001, Proceedings Tenth International Conference on Computer Communications and Networks (Cat. No.01EX495).

[16]  Malgorzata Steinder,et al.  Probabilistic fault localization in communication systems using belief networks , 2004, IEEE/ACM Transactions on Networking.

[17]  G. Jakobson,et al.  Alarm correlation , 1993, IEEE Network.

[18]  Robert H. Deng,et al.  A Probabilistic Approach to Fault Diagnosis in Linear Lightware Networks , 1993, IEEE J. Sel. Areas Commun..

[19]  Stefan Kiitker A Modeling Framework for Integrated Distributed Systems Fault Management , 1996 .

[20]  Mariusz A. Fecko,et al.  Combinatorial designs in multiple faults localization for battlefield networks , 2001, 2001 MILCOM Proceedings Communications for Network-Centric Operations: Creating the Information Force (Cat. No.01CH37277).

[21]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[22]  S. Katker A modeling framework for integrated distributed systems fault management , 1996 .

[23]  C. S. Chao,et al.  An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation , 2004, Journal of Network and Systems Management.

[24]  D. Zager,et al.  Value-oriented network management , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[25]  R. Comerford The new software paladins , 2000 .

[26]  Sheng Ma,et al.  Intelligent probing: A cost-effective approach to fault diagnosis in computer networks , 2002, IBM Syst. J..

[27]  Yuichi Nakamura,et al.  Building Web Services with Java: Making Sense of XML, SOAP, WSDL, and UDDI , 2001 .

[28]  J. Welch Virtual private networks. , 2018, Journal of AHIMA.

[29]  Malgorzata Steinder,et al.  Non-deterministic Event-driven Fault Diagnosis through Incremental Hypothesis Updating , 2003 .

[30]  Mischa Schwartz,et al.  Identification of Faulty Links in Dynamic-Routed Networks , 1993, IEEE J. Sel. Areas Commun..

[31]  Yuichi Nakamura,et al.  Building Web Services With Java , 2002 .

[32]  Malgorzata Steinder,et al.  Probabilistic event-driven fault diagnosis through incremental hypothesis updating , 2003 .

[33]  Radia Perlman,et al.  Interconnections: Bridges, Routers, Switches, and Internetworking Protocols , 1999 .