Automatic alarm correlation for fault identification

In communication networks, a large number of alarms exist to signal any abnormal behavior of the network. As network faults typically result in a number of alarms, correlating these different alarms and identifying their source is a major problem in fault management. The alarm correlation problem is of major practical significance. Alarms that have not been correlated may not only lead to significant misdirected efforts, based on insufficient information, but may cause multiple corrective actions (possibly contradictory) as each alert is handled independently. The paper proposes a general framework to solve the alarm correlation problem. The authors introduce a new model for faults and alarms based on probabilistic finite state machines. They propose two algorithms. The first one acquires the fault models starting from possibly incomplete and incorrect date. The second one correlates alarms in the presence of multiple faults and noisy information. Both algorithms have polynomial time complexity, use an extension of the Viterbi algorithm to deal with the corrupted data, and can be implemented in hardware. As an example, they are applied to analyse faults using data generated by the ANS (Advanced Network and Services, Inc.)/NSF T3 network.

[1]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[2]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[3]  Seraphin B. Calo,et al.  Alarm correlation and fault identification in communication networks , 1994, IEEE Trans. Commun..

[4]  Mischa Schwartz,et al.  Fault identification using a finite state machine model with unreliable partially observed data sequences , 1993, IEEE Trans. Commun..

[5]  Paul H. Callahan Expert systems for AT&T switched network maintenance , 1988, AT&T Technical Journal.

[6]  Isabelle Rouvellou,et al.  Inference of a probabilistic finite state machine from its output , 1995, IEEE Trans. Syst. Man Cybern..

[7]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[8]  George W. Hart,et al.  Correcting dependent errors in sequences generated by finite-state processes , 1993, IEEE Trans. Inf. Theory.

[9]  Yun Peng,et al.  A Probabilistic Causal Model for Diagnostic Problem Solving Part II: Diagnostic Strategy , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[10]  A. Danthine,et al.  Protocol Representation with Finite-State Models , 1980, IEEE Trans. Commun..

[11]  Gregor von Bochmann,et al.  Formal Methods in Communication Protocol Design , 1980, IEEE Trans. Commun..

[12]  T.E. Marques A symptom-driven expert system for isolating and correcting network faults , 1988, IEEE Communications Magazine.

[13]  Daniel Brand,et al.  On Communicating Finite-State Machines , 1983, JACM.

[14]  Yun Peng,et al.  A Probabilistic Causal Model for Diagnostic Problem Solving Part I: Integrating Symbolic Causal Inference with Numeric Probabilistic Inference , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .