Towards Distributed and Adaptive Detection and Localisation of Network Faults

We present a statistical probing-approach to distributed fault-detection in networked systems, based on autonomous configuration of algorithm parameters. Statistical modelling is used for detection and localisation of network faults. A detected fault is isolated to a node or link by collaborative fault-localisation. From local measurements obtained through probing between nodes, probe response delay and packet drop are modelled via parameter estimation for each link. Estimated model parameters are used for autonomous configuration of algorithm parameters, related to probe intervals and detection mechanisms. Expected fault-detection performance is formulated as a cost instead of specific parameter values, significantly reducing configuration efforts in a distributed system. The benefit offered by using our algorithm is fault-detection with increased certainty based on local measurements, compared to other methods not taking observed network conditions into account. We investigate the algorithm performance for varying user parameters and failure conditions. The simulation results indicate that more than 95% of the generated faults can be detected with few false alarms. At least 80% of the link faults and 65% of the node faults are correctly localised. The performance can be improved by parameter adjustments and by using alternative paths for communication of algorithm control messages.

[1]  Rajeev Rastogi,et al.  Robust Monitoring of Link Delays and Faults in IP Networks , 2003, IEEE/ACM Transactions on Networking.

[2]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[3]  Johannes Färber,et al.  Network game traffic modelling , 2002, NetGames '02.

[4]  Qi Han,et al.  Journal of Network and Systems Management ( c ○ 2007) DOI: 10.1007/s10922-007-9062-0 A Survey of Fault Management in Wireless Sensor Networks , 2022 .

[5]  Ratul Mahajan,et al.  Measuring ISP topologies with rocketfuel , 2002, TNET.

[6]  P. Kumar,et al.  Probability distributions conditioned by the available information: Gamma distribution and moments , 2006, Comput. Math. Appl..

[7]  Rudolf Hornig,et al.  An overview of the OMNeT++ simulation environment , 2008, Simutools 2008.

[8]  Randy H. Katz,et al.  On failure detection algorithms in overlay networks , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[9]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[10]  Vishal Misra,et al.  Toward Optimal Network Fault Correction via End-to-End Inference , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[11]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[12]  Hyoung-Kee Choi,et al.  A behavioral model of Web traffic , 1999, Proceedings. Seventh International Conference on Network Protocols.

[13]  Sheng Ma,et al.  Adaptive diagnosis in distributed systems , 2005, IEEE Transactions on Neural Networks.

[14]  Rebecca Steinert,et al.  An initial approach to distributed adaptive fault-handling in networked systems , 2009 .

[15]  Tea-Yuan Hwang,et al.  On New Moment Estimation of Parameters of the Gamma Distribution Using its Characterization , 2002 .

[16]  Hari Balakrishnan,et al.  Resilient overlay networks , 2001, SOSP.

[17]  Ehab Al-Shaer,et al.  Active integrated fault localization in communication networks , 2005, 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005..