Efficient fault diagnosis using incremental alarm correlation and active investigation for internet and overlay networks

Fault localization is the core element in fault management. Symptom-fault map is commonly used to describe the symptom-fault causality in fault reasoning. For Internet service networks, a well-designed monitoring system can effectively correlate the observable symptoms (i.e., alarms) with the critical network faults (e.g., link failure). However, the lost and spurious symptoms can significantly degrade the performance and accuracy of a passive fault localization system. For overlay networks, due to limited underlying network accessibility, as well as the overlay scalability and dynamics, it is impractical to build a static overlay symptom-fault map. In this paper, we firstly propose a novel active integrated fault reasoning (AIR) framework to incrementally incorporate active investigation actions into the passive fault reasoning process based on an extended symptom-fault-action (SFA) model. Secondly, we propose an overlay network profile (ONP) to facilitate the dynamic creation of an overlay symptom-fault-action (called O-SFA) model, such that the AIR framework can be applied seamlessly to overlay networks (called O-AIR). As a result, the corresponding fault reasoning and action selection algorithms are elaborated. Extensive simulations and Internet experiments show that AIR and O-AIR can significantly improve both accuracy and performance in the fault reasoning for Internet and overlay service networks, especially when the ratio of the lost and spurious symptoms is high.

[1]  Ming Zhang,et al.  PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services , 2004, OSDI.

[2]  Guangtian Liu,et al.  Composite events for network event correlation , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[3]  V. Paxson End-to-end routing behavior in the internet , 2006, CCRV.

[4]  Sheng Ma,et al.  Real-time problem determination in distributed systems using active probing , 2004, 2004 IEEE/IFIP Network Operations and Management Symposium (IEEE Cat. No.04CH37507).

[5]  Malgorzata Steinder,et al.  Increasing robustness of fault localization through analysis of lost, spurious, and positive symptoms , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[6]  Ibrahim Matta,et al.  On the origin of power laws in Internet topologies , 2000, CCRV.

[7]  Ehab Al-Shaer,et al.  QoS Path Monitoring for Multicast Networks , 2002, Journal of Network and Systems Management.

[8]  G. Jakobson,et al.  Alarm correlation , 1993, IEEE Network.

[9]  Malgorzata Steinder,et al.  Probabilistic fault diagnosis in communication systems through incremental hypothesis updating , 2004, Comput. Networks.

[10]  Ehab Al-Shaer,et al.  Active integrated fault localization in communication networks , 2005, 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005..

[11]  Robert Nowak,et al.  Internet tomography , 2002, IEEE Signal Process. Mag..

[12]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[13]  Sheng Ma,et al.  Optimizing Probe Selection for Fault Localization , 2001, DSOM.

[14]  Salvatore J. Stolfo,et al.  A coding approach to event correlation , 1995, Integrated Network Management.

[15]  PROCEssIng magazInE IEEE Signal Processing Magazine , 2004 .

[16]  George J. Lee,et al.  Diagnosis of TCP overlay connection failures using bayesian networks , 2006, MineNet '06.

[17]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[18]  Jie Gao,et al.  Approaches to building self healing systems using dependency analysis , 2004, 2004 IEEE/IFIP Network Operations and Management Symposium (IEEE Cat. No.04CH37507).

[19]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[20]  Yao Zhao,et al.  Towards Unbiased End-to-End Network Diagnosis , 2006, IEEE/ACM Transactions on Networking.

[21]  Joseph L. Hellerstein,et al.  A framework for applying inventory control to capacity management for utility computing , 2005, 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005..