The present and future of event correlation: A need for end-to-end service fault localization

Fault localization is a process of isolating faults responsible for the observable malfunctioning of the managed system. Until recently, fault localization efforts concentrated mostly on diagnosing faults related to the availability of network resources in the lowest layers of the protocol stack. Modern enterprise environments require that fault diagnosis be performed in integrated fashion in multiple layers of the protocol stack and that it include diagnosing performance problems. This paper reviews the existing approaches to fault localization and presents its new facets revealed by the demands of modern enterprise systems. We also present end-to-end service failure diagnosis as a critical step towards multi-layer fault localization in an enterprise environment.

[1]  Mischa Schwartz,et al.  Schemes for fault identification in communication networks , 1995, TNET.

[2]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[3]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[4]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[5]  Seraphin B. Calo,et al.  Towards a practical alarm correlation system , 1995, Integrated Network Management.

[6]  D. Zager,et al.  Value-oriented network management , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[7]  Kar-Wing Edward Lor,et al.  A Network Diagnostic Expert System for Acculink Multiplexers Based on a General Network Diagnostic Scheme , 1993, IFIP/IEEE Symposium on Integrated Network Management.

[8]  Rajeev Gopal,et al.  Layered model for supporting fault isolation and recovery , 2000, NOMS 2000. 2000 IEEE/IFIP Network Operations and Management Symposium 'The Networked Planet: Management Beyond 2000' (Cat. No.00CB37074).

[9]  Seraphin B. Calo,et al.  Alarm correlation and fault identification in communication networks , 1994, IEEE Trans. Commun..

[10]  Ramesh Viswanathan,et al.  A conceptual framework for network management event correlation and filtering systems , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[11]  Adarshpal S. Sethi,et al.  Multi-layer Fault Localization Using Probabilistic Inference in Bipartite Dependency Graphs , 2001 .

[12]  Benny Rochwerger,et al.  Oceano-SLA based management of a computing utility , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[13]  Boris Gruschke,et al.  INTEGRATED EVENT MANAGEMENT: EVENT CORRELATION USING DEPENDENCY GRAPHS , 1998 .

[14]  Robert H. Deng,et al.  A Probabilistic Approach to Fault Diagnosis in Linear Lightware Networks , 1993, IEEE J. Sel. Areas Commun..

[15]  Keith McCloghrie,et al.  Protocol Operations for version 2 of the Simple Network Management Protocol (SNMPv2) , 1993, RFC.

[16]  Seraphin B. Calo,et al.  Centralized vs distributed fault localization , 1995, Integrated Network Management.

[17]  Roberto Manione,et al.  An Expert System for Real Time Fault Diagnosis of the Italian Telecommunications Network , 1993, Integrated Network Management.

[18]  Malgorzata Steinder,et al.  Yemanja-a layered event correlation engine for multi-domain server farms , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[19]  Martin Paterok,et al.  Event Correlation in Heterogeneous Networks Using the OSI Management Framework , 1993, Integrated Network Management.

[20]  Yossi A. Nygate,et al.  Event correlation using rule and object based techniques , 1995, Integrated Network Management.

[21]  Aaron B. Brown,et al.  An Active Approach to Characterizing Dynamic Dependencies for Problem Determination in a Distributed Application Environment , 2000 .

[22]  G. Jakobson,et al.  Alarm correlation , 1993, IEEE Network.

[23]  Mark Weissman,et al.  Real-time telecommunication network management: extending event correlation with temporal constraints , 1995, Integrated Network Management.

[24]  Salvatore J. Stolfo,et al.  A coding approach to event correlation , 1995, Integrated Network Management.

[25]  Mischa Schwartz,et al.  Identification of Faulty Links in Dynamic-Routed Networks , 1993, IEEE J. Sel. Areas Commun..

[26]  Guangtian Liu,et al.  Composite events for network event correlation , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[27]  Stefan Kätker,et al.  Fault Isolation and Event Correlation for Integrated Fault Management , 1997, Integrated Network Management.

[28]  Andrew Hiles Service Level Agreements: Measuring Cost and Quality in Service Relationships , 1993 .