Adaptive diagnosis in distributed systems

Real-time problem diagnosis in large distributed computer systems and networks is a challenging task that requires fast and accurate inferences from potentially huge data volumes. In this paper, we propose a cost-efficient, adaptive diagnostic technique called active probing . Probes are end-to-end test transactions that collect information about the performance of a distributed system. Active probing uses probabilistic reasoning techniques combined with information-theoretic approach, and allows a fast online inference about the current system state via active selection of only a small number of most-informative tests. We demonstrate empirically that the active probing scheme greatly reduces both the number of probes (from 60% to 75% in most of our real-life applications), and the time needed for localizing the problem when compared with nonadaptive (preplanned) probing schemes. We also provide some theoretical results on the complexity of probe selection, and the effect of "noisy" probes on the accuracy of diagnosis. Finally, we discuss how to model the system's dynamics using dynamic Bayesian networks (DBNs), and an efficient approximate approach called sequential multifault; empirical results demonstrate clear advantage of such approaches over "static" techniques that do not handle system's changes.

[1]  Sheng Ma,et al.  Active Probing Strategies for Problem Diagnosis in Distributed Systems , 2003, IJCAI.

[2]  Allan Leinwand,et al.  Network management (2nd ed.): a practical perspective , 1995 .

[3]  Brian C. Williams,et al.  Diagnosing Multiple Faults , 1987, Artif. Intell..

[4]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[5]  Rina Dechter,et al.  Mini-buckets: A general scheme for bounded inference , 2003, JACM.

[6]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[7]  Kevin Murphy,et al.  Bayes net toolbox for Matlab , 1999 .

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Sheng Ma,et al.  Accuracy vs. efficiency trade-offs in probabilistic diagnosis , 2002, AAAI/IAAI.

[10]  Russell R. Barton,et al.  Zone recovery methodology for probe-subset selection in end-to-end network monitoring , 2002, NOMS 2002. IEEE/IFIP Network Operations and Management Symposium. ' Management Solutions for the New Communications World'(Cat. No.02CH37327).

[11]  Salvatore J. Stolfo,et al.  A coding approach to event correlation , 1995, Integrated Network Management.

[12]  Chuanyi Ji,et al.  Proactive network fault detection , 1997, Proceedings of INFOCOM '97.

[13]  Rina Dechter,et al.  Bucket elimination: A unifying framework for probabilistic inference , 1996, UAI.

[14]  John E. Hopcroft,et al.  Complexity of Computer Computations , 1974, IFIP Congress.

[15]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[16]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[17]  Rina Dechter,et al.  Mini-buckets: a general scheme for approximating inference , 2002 .

[18]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[19]  T. Vámos,et al.  Judea pearl: Probabilistic reasoning in intelligent systems , 1992, Decision Support Systems.

[20]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[21]  Sheng Ma,et al.  Real-time problem determination in distributed systems using active probing , 2004, 2004 IEEE/IFIP Network Operations and Management Symposium (IEEE Cat. No.04CH37507).

[22]  Aurel A. Lazar,et al.  Fault Isolation Based on Decision-theoretic Troubleshooting Fault Isolation Based on Decision-theoretic Troubleshooting , 1996 .

[23]  Mischa Schwartz,et al.  Schemes for fault identification in communication networks , 1995, TNET.

[24]  Sheng Ma,et al.  Optimizing Probe Selection for Fault Localization , 2001, DSOM.

[25]  Raymond E. Miller,et al.  Complexity of Computer Computations , 1972 .

[26]  Sheng Ma,et al.  Intelligent probing: A cost-effective approach to fault diagnosis in computer networks , 2002, IBM Syst. J..

[27]  Irina Rish,et al.  Multi-fault Diagnosis in Dynamic Systems , 2005 .

[28]  Rina Dechter,et al.  A Scheme for Approximating Probabilistic Inference , 1997, UAI.

[29]  Allan Leinwand,et al.  Network Management: A Practical Perspective , 1993 .

[30]  Boris Gruschke,et al.  INTEGRATED EVENT MANAGEMENT: EVENT CORRELATION USING DEPENDENCY GRAPHS , 1998 .

[31]  Nevin Lianwen Zhang,et al.  Exploiting Causal Independence in Bayesian Network Inference , 1996, J. Artif. Intell. Res..

[32]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[33]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[34]  Kevin P. Murphy,et al.  The Factored Frontier Algorithm for Approximate Inference in DBNs , 2001, UAI.