Distributed Fault Localization in Hierarchically Routed Networks

Probabilistic inference was shown effective in non-deterministic diagnosis of end-to-end service failures. To overcome the exponential complexity of the exact inference algorithms in fault propagation models represented by graphs with undirected loops, Pearl's iterative algorithms for polytrees were used as an approximation schema. The approximation made it possible to diagnose end-to-end service failures in network topologies composed of tens of nodes. This paper proposes a distributed algorithm that increases the admissible network size by an order of magnitude. The algorithm divides the computational effort and system knowledge among multiple, hierarchically organized managers. The cooperation among managers is illustrated with examples, and the results of a preliminary performance study are presented.

[1]  Malgorzata Steinder,et al.  End-to-end service failure diagnosis using belief networks , 2002, NOMS 2002. IEEE/IFIP Network Operations and Management Symposium. ' Management Solutions for the New Communications World'(Cat. No.02CH37327).

[2]  Robert H. Deng,et al.  A Probabilistic Approach to Fault Diagnosis in Linear Lightware Networks , 1993, IEEE J. Sel. Areas Commun..

[3]  Malgorzata Steinder,et al.  Non-deterministic diagnosis of end-to-end service failures in a multi-layer communication system , 2001, Proceedings Tenth International Conference on Computer Communications and Networks (Cat. No.01EX495).

[4]  Marcos Novaes,et al.  Beacon: A Hierarchical Network Topology Monitoring System Based on IP Multicast , 2000, DSOM.

[5]  Judea Pearl,et al.  Chapter 2 – BAYESIAN INFERENCE , 1988 .

[6]  Rajeev Rastogi,et al.  Topology discovery in heterogeneous IP networks , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[7]  Malgorzata Steinder,et al.  Non-deterministic fault localiza-tion in communication systems using belief networks , 2002 .

[8]  Mischa Schwartz,et al.  Schemes for fault identification in communication networks , 1995, TNET.

[9]  Anukool Lakhina,et al.  BRITE: Universal Topology Generation from a User''s Perspective , 2001 .

[10]  Michael P. Wellman,et al.  Bayesian networks , 1995, CACM.

[11]  G. Jakobson,et al.  Alarm correlation , 1993, IEEE Network.

[12]  Malgorzata Steinder,et al.  The present and future of event correlation: A need for end-to-end service fault localization , 2001 .

[13]  Mischa Schwartz,et al.  Identification of Faulty Links in Dynamic-Routed Networks , 1993, IEEE J. Sel. Areas Commun..

[14]  Mariusz A. Fecko,et al.  Combinatorial designs in multiple faults localization for battlefield networks , 2001, 2001 MILCOM Proceedings Communications for Network-Centric Operations: Creating the Information Force (Cat. No.01CH37277).

[15]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[16]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[17]  Malgorzata Steinder,et al.  Increasing robustness of fault localization through analysis of lost, spurious, and positive symptoms , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[18]  Salvatore J. Stolfo,et al.  A coding approach to event correlation , 1995, Integrated Network Management.

[19]  Yves Raynaud,et al.  Integrated Network Management IV , 1995, IFIP — The International Federation for Information Processing.