An alarm management framework for automated network fault identification

Many timing constraint (or real-time) distributed systems, such as real-time database systems, are now being used in safety critical applications. However, they are subject to system failures caused by the malfunction of underlying network components. Without the helps of network experts or sophisticated management tools, most users cannot resolve these network problems by themselves. Sometimes, worse, it is usually prohibited to use these management tools, e.g. the 'ping' command, for the security sake. Accordingly, we develop a management system to automate network fault identification merely based on the analysis of the abnormal events from the monitored timing constraint distributed system. In this system, a fault identification framework is designed to identify automatically faulty network elements by using a two-level fault propagation model which combines Timing Constraint Petri nets with an alarm clustering mechanism. In addition, the concepts of redundant/ringleader alarms and innocent network elements are also introduced into the framework to obtain an effective diagnosis. At last, the management system is implemented according to the framework to demonstrate the performance of our fault identification.

[1]  Jeffrey J. P. Tsai,et al.  Timing Constraint Petri Nets and Their Application to Schedulability Analysis of Real-Time System Specifications , 1995, IEEE Trans. Software Eng..

[2]  Dinesh Gambhir,et al.  A framework for adding real-time distributed software fault detection and isolation to SNMP-based systems management , 2005, Journal of Network and Systems Management.

[3]  Tadao Murata,et al.  Petri nets: Properties, analysis and applications , 1989, Proc. IEEE.

[4]  C. S. Chao,et al.  An Automated Fault Diagnosis System Using Hierarchical Reasoning and Alarm Correlation , 2004, Journal of Network and Systems Management.

[5]  Seraphin B. Calo,et al.  Alarm correlation and fault identification in communication networks , 1994, IEEE Trans. Commun..

[6]  George Pavlou,et al.  Exploiting the power of OSI management for the control of SNMP-capable resources using generic application level gateways , 1995, Integrated Network Management.

[7]  C. S. Chao,et al.  A Visualization Modeling Framework for a CSP-Based System , 2000, J. Vis. Lang. Comput..

[8]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[9]  Aiko Pras,et al.  Proceedings of the 9th IFIP/IEEE International Symposium on Integrated Network Management , 2005 .

[10]  Don-Lin Yang,et al.  A time-aware fault diagnosis system in LAN , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[11]  Kurt Geihs,et al.  A Generic Model for Fault Isolation in Integrated Management Systems , 1997, Journal of Network and Systems Management.

[12]  Mischa Schwartz,et al.  Schemes for fault identification in communication networks , 1995, TNET.