Strategies for Problem Determination using Probing

As distributed systems continue to grow in size and complexity, scalable and cost-efficient techniques are needed for performing tasks such as problem determination and fault diagnosis. In this paper, we address these tasks using probes, or test transactions, which replace traditional “passive” event-correlation techniques with a more active, real-time information-gathering approach. We provide a theoretical foundation and a set of practical techniques for implementing efficient probing strategies the main issue is minimizing the cost of probing while maximizing the diagnostic accuracy of the probe set. We show that finding an optimal probe set is NP-hard and devise polynomial-time approximation algorithms that demonstrate excellent empirical performance, even on large networks. We also implement an active, on-line probing strategy that yields a significant reduction in the probe set size.

[1]  Donald F. Towsley,et al.  Multicast-based loss inference with missing data , 2002, IEEE J. Sel. Areas Commun..

[2]  Chuanyi Ji,et al.  Measurement-based network monitoring and inference: scalability and missing information , 2002, IEEE J. Sel. Areas Commun..

[3]  Salvatore J. Stolfo,et al.  A coding approach to event correlation , 1995, Integrated Network Management.

[4]  John E. Hopcroft,et al.  Complexity of Computer Computations , 1974, IFIP Congress.

[5]  Donald F. Towsley,et al.  Inferring link loss using striped unicast probes , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[6]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[7]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..

[8]  Vern Paxson,et al.  End-to-end Internet packet dynamics , 1997, SIGCOMM '97.

[9]  Russell R. Barton,et al.  Zone recovery methodology for probe-subset selection in end-to-end network monitoring , 2002, NOMS 2002. IEEE/IFIP Network Operations and Management Symposium. ' Management Solutions for the New Communications World'(Cat. No.02CH37327).

[10]  Sheng Ma,et al.  Accuracy vs. efficiency trade-offs in probabilistic diagnosis , 2002, AAAI/IAAI.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Allan Leinwand,et al.  Network management (2nd ed.): a practical perspective , 1995 .

[13]  Chuanyi Ji,et al.  Proactive network fault detection , 1997, Proceedings of INFOCOM '97.

[14]  John W. Sheppard,et al.  System Level Diagnosis , 1994 .

[15]  Boris Gruschke,et al.  INTEGRATED EVENT MANAGEMENT: EVENT CORRELATION USING DEPENDENCY GRAPHS , 1998 .

[16]  Sheng Ma,et al.  Optimizing Probe Selection for Fault Localization , 2001, DSOM.

[17]  Raymond E. Miller,et al.  Complexity of Computer Computations , 1972 .

[18]  Allan Leinwand,et al.  Network Management: A Practical Perspective , 1993 .

[19]  Charles R. Kime,et al.  System Fault Diagnosis: Closure and Diagnosability with Repair , 1975, IEEE Transactions on Computers.