The scheme design of distributed systems service fault management based on active probing

Service fault management in distributed computer systems and networks is a difficult task that requires high efficient inferences from mass data. In this paper, we propose a corresponding solution. Firstly, challenges of distributed systems service fault management are analyzed, and a multilayer model is recommended. Then, a dependency matrix to represent the causal relationship between faults and probes is defined and the framework of fault management is built. After these, a service fault management scheme using active probing is proposed. This scheme is composed of two phases: fault detection and fault localization. In first phase, we propose a probe selection algorithm, which selects a minimal set of probes while remaining a high probability of fault detection. In second phase, we propose a fault localization probe selection algorithm, which selects probes to obtain more system information based on the symptoms observed in previous phase. Finally, the instance proves the validity and efficiency of our scheme.

[1]  Malgorzata Steinder,et al.  Probabilistic fault localization in communication systems using belief networks , 2004, IEEE/ACM Transactions on Networking.

[2]  Heiko Ludwig,et al.  The WSLA Framework: Specifying and Monitoring Service Level Agreements for Web Services , 2003, Journal of Network and Systems Management.

[3]  Fei Li,et al.  End-to-End Service Quality Measurement Using Source-Routed Probes , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[4]  Rajeev Rastogi,et al.  Diagnosing Link-Level Anomalies Using Passive Probes , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[5]  Malgorzata Steinder,et al.  A survey of fault localization techniques in computer networks , 2004, Sci. Comput. Program..

[6]  Maitreya Natu,et al.  Probabilistic Fault Diagnosis Using Adaptive Probing , 2007, DSOM.

[7]  J. Crowcroft,et al.  On the monitoring of contractual service level agreements , 2004, Proceedings. First IEEE International Workshop on Electronic Contracting, 2004..

[8]  Xiaohui Huang,et al.  Fault management for Internet Services: Modeling and Algorithms , 2006, 2006 IEEE International Conference on Communications.

[9]  Zhixiong Chen Proactive Probing and Probing On Demand in Service Fault Localization , 2005 .