Multi-layer Fault Localization Using Probabilistic Inference in Bipartite Dependency Graphs

For the purpose of fault diagnosis, communication sys- tems are frequently modeled in a layered fashion imitating the layered architecture of the modeled system. The layered model represents relationships between services, protocols, and func- tions offered between neighboring protocol layers. In a given layer, an end-to-end service between two hosts may be provided using multiple hop-to-hop services offered in this layer between two hosts on the end-to-end path. When an end-to-end service fails or experiences performance problems it is critical to effi- ciently find the responsible hop-to-hop services. Dependencies be- tween end-to-end and hop-to-hop services form a bipartite graph whose structure depends on the network topology in the corre- sponding protocol layer. To represent the uncertainty in the de- pendency graph, probabilities are assigned to its nodes and links. Finding the most probable explanation (MPE) of the observed symptoms in the probabilistic dependency graph is NP-hard. We transform the bipartite dependency graph to a belief network and investigate several algorithms for computing MPE such as bucket tree elimination and two approximations based on Pearl's itera- tive algorithms. We also introduce a novel algorithm using an it- erative hypothesis update. These algorithms are implemented in Java and their performance and accuracy are evaluated through extensive simulation study.

[1]  Rainer Hauck,et al.  Monitoring of Service Level Agreements with exible and extensible Agents , 1999 .

[2]  Keith McCloghrie,et al.  IEEE 802.5 Station Source Routing MIB using SMIv2 , 1994, RFC.

[3]  D. Zager,et al.  Value-oriented network management , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[4]  Peng Wu,et al.  Alarm correlation engine (ACE) , 1998, NOMS 98 1998 IEEE Network Operations and Management Symposium.

[5]  Yossi A. Nygate,et al.  Event correlation using rule and object based techniques , 1995, Integrated Network Management.

[6]  Judea Pearl,et al.  Bayesian Networks , 1998, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[7]  Aaron B. Brown,et al.  An Active Approach to Characterizing Dynamic Dependencies for Problem Determination in a Distributed Application Environment , 2000 .

[8]  Nei Kato,et al.  Divide and Conquer Technique for Network Fault Management , 1997, Integrated Network Management.

[9]  Christian Ensel Automated Generation of Dependency Models for Service Management , 1999 .

[10]  Mischa Schwartz,et al.  Schemes for fault identification in communication networks , 1995, TNET.

[11]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[12]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.

[13]  Martin Paterok,et al.  Event Correlation in Heterogeneous Networks Using the OSI Management Framework , 1993, Integrated Network Management.

[14]  Seraphin B. Calo,et al.  Alarm correlation and fault identification in communication networks , 1994, IEEE Trans. Commun..

[15]  Brendan J. Frey,et al.  Iterative Decoding of Compound Codes by Probability Propagation in Graphical Models , 1998, IEEE J. Sel. Areas Commun..

[16]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[17]  David A. Maltz,et al.  Dynamic Source Routing in Ad Hoc Wireless Networks , 1994, Mobidata.

[18]  Seraphin B. Calo,et al.  Towards a practical alarm correlation system , 1995, Integrated Network Management.

[19]  Keith McCloghrie,et al.  Definitions of Managed Objects for Bridges , 1993, RFC.

[20]  Jeffrey D. Case,et al.  Simple Network Management Protocol (SNMP) , 1989, RFC.

[21]  Rina Dechter,et al.  A Scheme for Approximating Probabilistic Inference , 1997, UAI.

[22]  Alexander Keller,et al.  Managing application services over service provider networks: architecture and dependency analysis , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[23]  A. Glavieux,et al.  Near Shannon limit error-correcting coding and decoding: Turbo-codes. 1 , 1993, Proceedings of ICC '93 - IEEE International Conference on Communications.

[24]  Stefan Kiitker A Modeling Framework for Integrated Distributed Systems Fault Management , 1996 .

[25]  Malgorzata Steinder,et al.  Yemanja-a layered event correlation engine for multi-domain server farms , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[26]  Marcos Novaes,et al.  Beacon: A Hierarchical Network Topology Monitoring System Based on IP Multicast , 2000, DSOM.

[27]  Rina Dechter,et al.  Bucket elimination: A unifying framework for probabilistic inference , 1996, UAI.

[28]  Radia Perlman,et al.  Interconnections: Bridges, Routers, Switches, and Internetworking Protocols , 1999 .

[29]  Salvatore J. Stolfo,et al.  A coding approach to event correlation , 1995, Integrated Network Management.

[30]  José Marcos S. Nogueira,et al.  Modelling a telecommunication network for fault management applications , 1998, NOMS.

[31]  R. Dechter,et al.  On the impact of causal independence , 1998 .

[32]  Keith McCloghrie,et al.  Definitions of Managed Objects for Bridges , 1991, RFC.

[33]  J. Welch Virtual private networks. , 2018, Journal of AHIMA.

[34]  B. Dang,et al.  Interconnections, second edition: bridges, routers, switches, and internetworking protocols [Bookshelf] , 2000, IEEE Software.

[35]  G. Jakobson,et al.  Alarm correlation , 1993, IEEE Network.

[36]  Ramesh Viswanathan,et al.  A conceptual framework for network management event correlation and filtering systems , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[37]  Keith McCloghrie,et al.  Protocol Operations for version 2 of the Simple Network Management Protocol (SNMPv2) , 1993, RFC.

[38]  J. Broch,et al.  Dynamic source routing in ad hoc wireless networks , 1998 .

[39]  Rajeev Gopal,et al.  Layered model for supporting fault isolation and recovery , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[40]  Robert H. Deng,et al.  A Probabilistic Approach to Fault Diagnosis in Linear Lightware Networks , 1993, IEEE J. Sel. Areas Commun..

[41]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[42]  Jung-Fu Cheng,et al.  Turbo Decoding as an Instance of Pearl's "Belief Propagation" Algorithm , 1998, IEEE J. Sel. Areas Commun..

[43]  Guangtian Liu,et al.  Composite events for network event correlation , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[44]  Stefan Kätker,et al.  Fault Isolation and Event Correlation for Integrated Fault Management , 1997, Integrated Network Management.

[45]  Andrew Hiles Service Level Agreements: Measuring Cost and Quality in Service Relationships , 1993 .

[46]  Marshall T. Rose,et al.  Management Information Base for network management of TCP/IP-based internets , 1990, RFC.

[47]  R. Comerford The new software paladins , 2000 .

[48]  Benny Rochwerger,et al.  Oceano-SLA based management of a computing utility , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[49]  Mariusz A. Fecko,et al.  Combinatorial designs in multiple faults localization for battlefield networks , 2001, 2001 MILCOM Proceedings Communications for Network-Centric Operations: Creating the Information Force (Cat. No.01CH37277).

[50]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[51]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..