Magnifier: Online Detection of Performance Problems in Large-Scale Cloud Computing Systems

In large-scale cloud computing systems, even a simple user request may go through numerous of services that are deployed on different physical machines. As a result, it is a great challenge to online localize the prime causes of performance degradation in such systems. Existing end-to-end request tracing approaches are not suitable for online anomaly detection because their time complexity is exponential in the size of the trace logs. In this paper, we propose an approach, namely Magnifier, to rapidly diagnose the source of performance degradation in large-scale non-stop cloud systems. In Magnifier, the execution path graph of a user request is modeled by a hierarchical structure including component layer, module layer and function layer, and anomalies are detected from higher layer to lower layer separately. In each layer every node is assigned a newly created identifier in addition to the global identifier of the request, which significantly decreases the size of parsing trace logs and accelerates the anomaly detection process. We conduct extensive experiments over a real-world enterprise system (the Alibaba cloud computing platform) providing services for the public. The results show that Magnifier can locate the prime causes of performance degradation more accurately and efficiently.

[1]  Vanish Talwar,et al.  Online detection of utility cloud anomalies using metric distributions , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[2]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[3]  Christophe Diot,et al.  Diagnosing network-wide traffic anomalies , 2004, SIGCOMM.

[4]  Nathan R. Tallent,et al.  Diagnosing performance bottlenecks in emerging petascale applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  Rajeev Gandhi,et al.  Visual, Log-Based Causal Tracing for Performance Debugging of MapReduce Systems , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[6]  Allen D. Malony,et al.  Optimization of Instrumentation in Parallel Performance Evaluation Tools , 2006, PARA.

[7]  Julio César López-Hernández,et al.  Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.

[8]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing System Behaviours , 2010 .

[9]  Adam H. Monahan,et al.  Nonlinear Principal Component Analysis by Neural Networks: Theory and Application to the Lorenz System , 2000 .

[10]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[11]  Karsten Schwan,et al.  SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[12]  GhemawatSanjay,et al.  The Google file system , 2003 .

[13]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[14]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[15]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[16]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[17]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[18]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[19]  S. J. QinDepartment Multi-dimensional Fault Diagnosis Using a Subspace Approach , 1997 .

[20]  Gregory R. Ganger,et al.  Ironmodel: robust performance models in the wild , 2008, SIGMETRICS '08.