CLUE: System trace analytics for cloud service performance diagnosis

In this paper, we present CLUE, a system event analytics tool for black-box performance diagnosis in production Cloud Computing systems. CLUE provides an unified and extensible means of profiling service transactional behaviors, and builds structured data called event sketches. CLUE further offers a set of analytic tools for summarizing and analyzing event sketches by integrating data mining and statistical analysis. CLUE has been developed in NEC as an internal tool and applied in diagnosing a diverse set of real performance problems for multi-tiered IT applications running on multi-core servers of major platforms including Linux (Redhat, Fedora), Unix (HP-UX), and Windows (Windows Server 2008). We demonstrated the evaluation of our framework on real-world IT systems, and showed how it can enable visibility and effective diagnosis of service system performance problems.

[1]  Úlfar Erlingsson,et al.  Fay: extensible distributed tracing from kernels to clusters , 2011, SOSP '11.

[2]  Jianfeng Zhan,et al.  Precise, Scalable, and Online Request Tracing for Multitier Services of Black Boxes , 2012, IEEE Transactions on Parallel and Distributed Systems.

[3]  Gregory R. Ganger,et al.  Diagnosing Performance Changes by Comparing Request Flows , 2011, NSDI.

[4]  Michel Dagenais,et al.  Analyzing blocking to debug performance problems on multi-core systems , 2010, OPSR.

[5]  Chun Zhang,et al.  vPath: Precise Discovery of Request Processing Paths from Black-Box Observations of Thread and Network Activities , 2009, USENIX Annual Technical Conference.

[6]  M. Desnoyers,et al.  The LTTng tracer: A low impact performance and behavior monitor for GNU/Linux , 2006 .

[7]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[8]  Marcos K. Aguilera,et al.  WAP5: black-box performance debugging for wide-area systems , 2006, WWW '06.

[9]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[10]  Francis Giraldeau,et al.  Recovering System Metrics from Kernel Trace , 2011 .

[11]  Hakim Weatherspoon,et al.  Fmeter: Extracting Indexable Low-Level System Signatures by Counting Kernel Function Calls , 2012, Middleware.

[12]  Richard J. Moore A Universal Dynamic Trace for Linux and Other Operating Systems , 2001, USENIX Annual Technical Conference, FREENIX Track.

[13]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[14]  Guofei Jiang,et al.  Software system performance debugging with kernel events feature guidance , 2014, 2014 IEEE Network Operations and Management Symposium (NOMS).

[15]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[16]  Brad Chen,et al.  Locating System Problems Using Dynamic Instrumentation , 2010 .

[17]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[18]  Evgenia Smirni,et al.  Automated anomaly detection and performance modeling of enterprise applications , 2009, TOCS.

[19]  C. T. Farley,et al.  Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome , 2008 .

[20]  Rauf Izmailov,et al.  Real-time Application Monitoring and Diagnosis for Service Hosting Platforms of Black Boxes , 2007, 2007 10th IFIP/IEEE International Symposium on Integrated Network Management.

[21]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[22]  Eric Koskinen,et al.  BorderPatrol: isolating events for black-box tracing , 2008, Eurosys '08.

[23]  Christopher Stewart,et al.  EntomoModel: Understanding and Avoiding Performance Anomaly Manifestations , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[24]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[25]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.