Failure diagnosis using decision trees

We present a decision tree learning approach to diagnosing failures in large Internet sites. We record runtime properties of each request and apply automated machine learning and data mining techniques to identify the causes of failures. We train decision trees on the request traces from time periods in which user-visible failures are present. Paths through the tree are ranked according to their degree of correlation with failure, and nodes are merged according to the observed partial order of system components. We evaluate this approach using actual failures from eBay, and find that, among hundreds of potential causes, the algorithm successfully identifies 13 out of 14 true causes of failure, along with 2 false positives. We discuss some results in applying simplified decision trees on eBay's production site for several months. In addition, we give a cost-benefit analysis of manual vs. automated diagnosis systems. Our contributions include the statistical learning approach, the adaptation of decision trees to the context of failure diagnosis, and the deployment and evaluation of our tools on a high-volume production service.

[1]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[2]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[3]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[4]  Jaesung Choi,et al.  An alarm correlation and fault identification scheme based on OSI managed object classes , 1999, 1999 IEEE International Conference on Communications (Cat. No. 99CH36311).

[5]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[6]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[7]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[8]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[9]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[10]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[11]  Sheng Ma,et al.  Accuracy vs. efficiency trade-offs in probabilistic diagnosis , 2002, AAAI/IAAI.

[12]  Armando Fox,et al.  Pinpoint: problem determination in large , 2002 .

[13]  Malgorzata Steinder,et al.  End-to-end service failure diagnosis using belief networks , 2002, NOMS 2002. IEEE/IFIP Network Operations and Management Symposium. ' Management Solutions for the New Communications World'(Cat. No.02CH37327).

[14]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[15]  Sheng Ma,et al.  Real-time problem determination in distributed systems using active probing , 2004, 2004 IEEE/IFIP Network Operations and Management Symposium (IEEE Cat. No.04CH37507).

[16]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[17]  Boris Gruschke A New Approach for Event Correlation based on Dependency Graphs , 1998 .

[18]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[19]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.