Performance debugging for distributed systems of black boxes

Many interesting large-scale systems are distributed systems of multiple communicating components. Such systems can be very hard to debug, especially when they exhibit poor performance. The problem becomes much harder when systems are composed of "black-box" components: software from many different (perhaps competing) vendors, usually without source code available. Typical solutions-provider employees are not always skilled or experienced enough to debug these systems efficiently. Our goal is to design tools that enable modestly-skilled programmers (and experts, too) to isolate performance bottlenecks in distributed systems composed of black-box nodes.We approach this problem by obtaining message-level traces of system activity, as passively as possible and without any knowledge of node internals or message semantics. We have developed two very different algorithms for inferring the dominant causal paths through a distributed system from these traces. One uses timing information from RPC messages to infer inter-call causality; the other uses signal-processing techniques. Our algorithms can ascribe delay to specific nodes on specific causal paths. Unlike previous approaches to similar problems, our approach requires no modifications to applications, middleware, or messages.

[1]  Susan L. Graham,et al.  Gprof: A call graph execution profiler , 1982, SIGPLAN '82.

[2]  B.P. Miller DPM: A Measurement System for Distributed Programs , 1988, IEEE Trans. Computers.

[3]  Barton P. Miller,et al.  Critical path analysis for the execution of parallel and distributed programs , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[4]  José Oncina,et al.  Learning Stochastic Regular Grammars by Means of a State Merging Method , 1994, ICGI.

[5]  Jerome A. Rolia,et al.  Automatic generation of a software performance model using an object-oriented prototype , 1995, MASCOTS '95. Proceedings of the Third International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[6]  David L. Mills,et al.  The Network Computer as Precision Timekeeper , 1996 .

[7]  Vern Paxson,et al.  Automated packet trace analysis of TCP implementations , 1997, SIGCOMM '97.

[8]  William E. Johnston,et al.  The NetLogger methodology for high performance distributed systems performance analysis , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[9]  Joseph L. Hellerstein,et al.  ETE: a customizable approach to measuring end-to-end response times and their components in distributed systems , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[10]  Emden R. Gansner,et al.  An open graph visualization system and its applications to software engineering , 2000, Softw. Pract. Exp..

[11]  Emden R. Gansner,et al.  An open graph visualization system and its applications to software engineering , 2000 .

[12]  Anja Feldmann BLT: Bi-Layer Tracing of HTTP and TCP/IP , 2000, Comput. Networks.

[13]  Yin Zhang,et al.  Detecting Stepping Stones , 2000, USENIX Security Symposium.

[14]  Anja Feldmann,et al.  A non-instrusive, wavelet-based approach to detecting network performance problems , 2001, IMW '01.

[15]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[16]  Saurabh Bagchi,et al.  Dependency Analysis in Distributed Systems using Fault Injection: Application to Problem Determination in an e-commerce Environment , 2001, DSOM.

[17]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[18]  Rebecca Isaacs,et al.  Performance analysis in loosely-coupled distributed systems , 2002 .

[19]  Emre Kiciman JBoss request-tracing in Pinpoint , 2003 .

[20]  Eric A. Brewer,et al.  Using Runtime Paths for Macroanalysis , 2003, HotOS.