Using Time over Threshold to Reduce Noise in Performance and Fault Management Systems

Fault management systems detect performance problems and intermittent failures by periodically examining a metric (such as the utilization of a link), and raising an alarm if the value is above a threshold. Such systems can generate numerous alarms. Various schemes have been proposed for reducing the number of alarms, or filtering out the important ones. The time over threshold detection algorithm reduces the volume of alarms at the source detector. This paper describes an experiment that compares time over threshold against simple threshold crossings. The experiment demonstrates that it reduces the number of alarms raised by a factor of 25 to 1 without any significant reduction in the problems detected.

[1]  Andrea Bondavalli,et al.  Threshold-Based Mechanisms to Discriminate Transient from Intermittent Faults , 2000, IEEE Trans. Computers.

[2]  Fan Zhang,et al.  An approach to on-line predictive detection , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[3]  Kornel Terplan,et al.  HP Openview: A Manager's Guide , 1997 .

[4]  Joseph L. Hellerstein,et al.  An approach to predictive detection for service management , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[5]  Thompson,et al.  Performance and Fault Management , 2000 .

[6]  Marina Thottan,et al.  Adaptive thresholding for proactive network problem detection , 1998, Proceedings of the IEEE Third International Workshop on Systems Management.

[7]  Walter Willinger,et al.  On the self-similar nature of Ethernet traffic , 1993, SIGCOMM '93.