Performance problems online detection in cloud computing systems via analyzing request execution paths

It is quite a headache for developers to online detect performance problems in large-scale cloud computing systems. The behavior and the hidden connections among the huge amount of runtime request execution paths in cloud computing systems usually contain useful information for performance problem detection. In this paper, we propose an approach to rapidly diagnose the source of performance degradation in large-scale non-stop cloud computing systems. The approach first groups the user requests into categories with a fast clustering algorithm; then applies the principal components analysis to extract the primary methods; finally compares the normal and abnormal behaviors of the primary methods to localize the main cause of performance problems. We conduct extensive experiments over a real-world enterprise system providing services for the public. The results show that our approach can locate the prime causes of performance problems accurately and efficiently.1

[1]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[2]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[3]  GhemawatSanjay,et al.  The Google file system , 2003 .

[4]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[5]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[6]  Gregory R. Ganger,et al.  Ironmodel: robust performance models in the wild , 2008, SIGMETRICS '08.

[7]  Ling Huang,et al.  Online System Problem Detection by Mining Patterns of Console Logs , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[8]  Konstantina Papagiannaki,et al.  Structural analysis of network traffic flows , 2004, SIGMETRICS '04/Performance '04.

[9]  Dan Meng,et al.  Precise request tracing and performance debugging for multi-tier services of black boxes , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[10]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[11]  Christophe Diot,et al.  Diagnosing network-wide traffic anomalies , 2004, SIGCOMM.

[12]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing System Behaviours , 2010 .

[13]  Gang Yin,et al.  Magnifier: Online Detection of Performance Problems in Large-Scale Cloud Computing Systems , 2011, 2011 IEEE International Conference on Services Computing.

[14]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[15]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[16]  Julio César López-Hernández,et al.  Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.