Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle

Understanding the performance behavior of distributed server stacks at scale is non-trivial. The servicing of just a single request can trigger numerous sub-requests across heterogeneous software components; and many similar requests are serviced concurrently and in parallel. When a user experiences poor performance, it is extremely difficult to identify the root cause, as well as the software components and machines that are the culprits. This paper describes Stitch, a non-intrusive tool capable of profiling the performance of an entire distributed software stack solely using the unstructured logs output by heterogeneous software components. Stitch is substantially different from all prior related tools in that it is capable of constructing a system model of an entire software stack without building any domain knowledge into Stitch. Instead, it automatically reconstructs the extensive domain knowledge of the programmers who wrote the code; it does this by relying on the Flow Reconstruction Principle which states that programmers log events such that one can reliably reconstruct the execution flow a posteriori.

[1]  Dawson R. Engler,et al.  A few billion lines of code later , 2010, Commun. ACM.

[2]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[3]  Jennifer Neville,et al.  Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems , 2012, NSDI.

[4]  Rodrigo Fonseca,et al.  Pivot tracing , 2018, USENIX ATC.

[5]  Krzysztof Z. Gajos,et al.  Evaluation of Filesystem Provenance Visualization Tools , 2013, IEEE Transactions on Visualization and Computer Graphics.

[6]  Donald Nute,et al.  Counterfactuals , 1975, Notre Dame J. Formal Log..

[7]  Rich Salz,et al.  A Universally Unique IDentifier (UUID) URN Namespace , 2005, RFC.

[8]  Ding Yuan,et al.  SherLog: error diagnosis by connecting clues from run-time logs , 2010, ASPLOS XV.

[9]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[10]  Yu Luo,et al.  lprof: A Non-intrusive Request Flow Profiler for Distributed Systems , 2014, OSDI.

[11]  Amin Vahdat,et al.  Pip: Detecting the Unexpected in Distributed Systems , 2006, NSDI.

[12]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[13]  Xiao Yu,et al.  CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs , 2016, ASPLOS.

[14]  Ding Yuan,et al.  How do fixes become bugs? , 2011, ESEC/FSE '11.

[15]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[16]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[17]  Alan L. Cox,et al.  Whodunit: transactional profiling for multi-tier applications , 2007, EuroSys '07.

[18]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[19]  Haoxiang Lin,et al.  G2: A Graph Processing System for Diagnosing Distributed Systems , 2011, USENIX Annual Technical Conference.

[20]  Rainer Gerhards,et al.  The Syslog Protocol , 2009, RFC.

[21]  Emery D. Berger,et al.  Coz: finding code that counts with causal profiling , 2015, USENIX Annual Technical Conference.

[22]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[23]  Michael I. Jordan,et al.  Detecting large-scale system problems by mining console logs , 2009, SOSP '09.

[24]  Thomas F. Wenisch,et al.  The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services , 2014, OSDI.

[25]  Yuriy Brun,et al.  Leveraging existing instrumentation to automatically infer invariant-constrained models , 2011, ESEC/FSE '11.