An online service-oriented performance profiling tool for cloud computing systems

The growing scale and complexity of component interactions in cloud computing systems post great challenges for operators to understand the characteristics of system performance. Profiling has long been proved to be an effective approach to performance analysis; however, existing approaches confront new challenges that emerge in cloud computing systems. First, the efficiency of the profiling becomes of critical concern; second, service-oriented profiling should be considered to support separation-of-concerns performance analysis. To address the above issues, in this paper, we present P-Tracer, an online performance profiling tool specifically tailored for cloud computing systems. P-Tracer constructs a specific search engine that proactively processes performance logs and generates a particular index for fast queries; second, for each service, P-Tracer retrieves a statistical insight of performance characteristics from multi-dimensions and provides operators with a suite of web-based interfaces to query the critical information. We evaluate P-Tracer in the aspects of tracing overheads, data preprocessing scalability and querying efficiency. Three real-world case studies that happened in Alibaba cloud computing platform demonstrate that P-Tracer can help operators understand software behaviors and localize the primary causes of performance anomalies effectively and efficiently.

[1]  Chun Zhang,et al.  vPath: Precise Discovery of Request Processing Paths from Black-Box Observations of Thread and Network Activities , 2009, USENIX Annual Technical Conference.

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Gideon S. Mann,et al.  Diagnosing Latency in Multi-Tier Black-Box Services , 2011 .

[4]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[5]  Jun Wei,et al.  An Adaptive Performance Modeling Approach to Performance Profiling of Multi-service Web Applications , 2011, 2011 IEEE 35th Annual Computer Software and Applications Conference.

[6]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[7]  Charles E. Brown Coefficient of Variation , 1998 .

[8]  Dan Meng,et al.  Precise request tracing and performance debugging for multi-tier services of black boxes , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[9]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[10]  Wh Round,et al.  Coefficient of Variation and the EEG , 2002 .

[11]  Henry Hoffmann,et al.  Quality of service profiling , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[12]  Gang Ren,et al.  Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers , 2010, IEEE Micro.

[13]  Gordon S. Blair,et al.  A generic component model for building systems software , 2008, TOCS.

[14]  Erez Zadok,et al.  DARC: dynamic analysis of root causes of latency distributions , 2008, SIGMETRICS '08.

[15]  Dmitri Bronnikov A practical adoption of partial redundancy elimination , 2004, SIGP.

[16]  Eric Koskinen,et al.  BorderPatrol: isolating events for black-box tracing , 2008, Eurosys '08.

[17]  David L. Mills,et al.  Network Time Protocol (Version 3) Specification, Implementation , 1992 .

[18]  Zhenbang Chen,et al.  P-Tracer: Path-Based Performance Profiling in Cloud Computing Systems , 2012, 2012 IEEE 36th Annual Computer Software and Applications Conference.

[19]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[20]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[21]  Julio César López-Hernández,et al.  Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.

[22]  M. Fay,et al.  Wilcoxon-Mann-Whitney or t-test? On assumptions for hypothesis tests and multiple interpretations of decision rules. , 2010, Statistics surveys.

[23]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[24]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[25]  Marcos K. Aguilera,et al.  WAP5: black-box performance debugging for wide-area systems , 2006, WWW '06.

[26]  Brian J. N. Wylie,et al.  Performance measurement and analysis tools for extremely scalable systems , 2010, ISC 2010.

[27]  Jianfeng Zhan,et al.  Precise, Scalable, and Online Request Tracing for Multitier Services of Black Boxes , 2012, IEEE Transactions on Parallel and Distributed Systems.

[28]  Sudipto Guha,et al.  Modeling the Parallel Execution of Black-Box Services , 2011, HotCloud.

[29]  Gregory R. Ganger,et al.  Ironmodel: robust performance models in the wild , 2008, SIGMETRICS '08.

[30]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[31]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[32]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[33]  David L. Mills,et al.  Network Time Protocol (Version 3) Specification, Implementation and Analysis , 1992, RFC.

[34]  Ahmed E. Hassan,et al.  Pinpointing the Subsystems Responsible for the Performance Deviations in a Load Test , 2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering.

[35]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.