IntroPerf: transparent context-sensitive multi-layer performance inference using system stack traces

Performance bugs are frequently observed in commodity software. While profilers or source code-based tools can be used at development stage where a program is diagnosed in a well-defined environment, many performance bugs survive such a stage and affect production runs. OS kernel-level tracers are commonly used in post-development diagnosis due to their independence from programs and libraries; however, they lack detailed program-specific metrics to reason about performance problems such as function latencies and program contexts. In this paper, we propose a novel performance inference system, called IntroPerf, that generates fine-grained performance information -- like that from application profiling tools -- transparently by leveraging OS tracers that are widely available in most commodity operating systems. With system stack traces as input, IntroPerf enables transparent context-sensitive performance inference, and diagnoses application performance in a multi-layered scope ranging from user functions to the kernel. Evaluated with various performance bugs in multiple open source software projects, IntroPerf automatically ranks potential internal and external root causes of performance bugs with high accuracy without any prior knowledge about or instrumentation on the subject software. Our results show IntroPerf's effectiveness as a lightweight performance introspection tool for post-development diagnosis.

[1]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[2]  James R. Larus,et al.  Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.

[3]  Andrew A. Chien,et al.  Coordinated thread scheduling for workstation clusters under windows NT , 1997 .

[4]  Richard J. Moore A Universal Dynamic Trace for Linux and Other Operating Systems , 2001, USENIX Annual Technical Conference, FREENIX Track.

[5]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[6]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[7]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[8]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[9]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[10]  Michael I. Jordan,et al.  Scalable statistical bug isolation , 2005, PLDI '05.

[11]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[12]  Jong-Deok Choi,et al.  Accurate, efficient, and adaptive calling context profiling , 2006, PLDI '06.

[13]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[14]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[15]  Michael D. Bond,et al.  Probabilistic calling context , 2007, OOPSLA.

[16]  Chun Zhang,et al.  vPath: Precise Discovery of Request Processing Paths from Black-Box Observations of Thread and Network Activities , 2009, USENIX Annual Technical Conference.

[17]  Trishul M. Chilimbi,et al.  HOLMES: Effective statistical debugging via efficient path profiling , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[18]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[19]  Brad Chen,et al.  Locating System Problems Using Dynamic Instrumentation , 2010 .

[20]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS 2010.

[21]  Michael D. Bond,et al.  Breadcrumbs: efficient context sensitivity for dynamic bug detection analyses , 2010, PLDI '10.

[22]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS XV.

[23]  Úlfar Erlingsson,et al.  Fay: extensible distributed tracing from kernels to clusters , 2011, SOSP '11.

[24]  Ding Yuan,et al.  Improving Software Diagnosability via Log Enhancement , 2012, TOCS.

[25]  Dongmei Zhang,et al.  Performance debugging in the large via mining millions of stack traces , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[26]  Shan Lu,et al.  Understanding and detecting real-world performance bugs , 2012, PLDI.

[27]  Xiangyu Zhang,et al.  Precise Calling Context Encoding , 2012, IEEE Trans. Software Eng..

[28]  Dongmei Zhang,et al.  Context-sensitive delta inference for identifying workload-dependent performance bottlenecks , 2013, ISSTA.