Performance problems diagnosis in cloud computing systems by mining request trace logs

In cloud computing systems, end-to-end request tracing approach is helpful for developers to understand the runtime behavior of user requests. Based on trace logs, we propose an approach to localize the abnormal methods that are the primary causes of performance problems. Our approach involves three steps: (1) cluster the user requests into different categories according to request call sequences and select major categories; (2) extract the principal methods that might be the causes of performance degradation; (3) pick out abnormal methods from those principal methods in each major category. We conduct four cases of performance degradations to validate our approach over a real-world enterprise-class cloud computing platform. The experimental results show that our approach can locate the prime causes of performance problems with low false-positive rate and false-negative rate.

[1]  Julio César López-Hernández,et al.  Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.

[2]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[3]  Mark Crovella,et al.  Diagnosing network-wide traffic anomalies , 2004, SIGCOMM '04.

[4]  Jennifer Rexford,et al.  Sensitivity of PCA for traffic anomaly detection , 2007, SIGMETRICS '07.

[5]  Hyatt Orlando 1994 IEEE Network Operations and Management Symposium , 1994, Proceedings of NOMS '94 - IEEE Network Operations and Management Symposium.

[6]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[7]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[8]  Gregory R. Ganger,et al.  Ironmodel: robust performance models in the wild , 2008, SIGMETRICS '08.

[9]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[10]  Gang Yin,et al.  Magnifier: Online Detection of Performance Problems in Large-Scale Cloud Computing Systems , 2011, 2011 IEEE International Conference on Services Computing.

[11]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[12]  Donald A. Jackson,et al.  Variable selection in large environmental data sets using principal components analysis , 1999 .

[13]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[14]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[15]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[16]  M. Fay,et al.  Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules. , 2010, Statistics surveys.

[17]  Eric A. Brewer,et al.  Cluster-based scalable network services , 1997, SOSP.

[18]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[19]  Erez Zadok,et al.  DARC: dynamic analysis of root causes of latency distributions , 2008, SIGMETRICS '08.

[20]  GhemawatSanjay,et al.  The Google file system , 2003 .