SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring

Runtime monitoring is key to the effective management of enterprise and high performance applications. To deal with the complex behaviors of today’s multi-tier applications running across shared platforms, such monitoring must meet three criteria: (1) fine granularity, including being able to track the resource usage of specific application behaviors like individual client-server interactions, (2) real-time response, referring to the monitoring system’s ability to both capture and analyze currently needed monitoring information with the delays required for online management, and (3) enterprise-wide operation, which means that the monitoring information captured and analyzed must span across the entire software stack and set of machines involved in request generation, request forwarding, service provision, and return. This paper presents the SysProf system-level monitoring toolkit, which provides a flexible, low overhead framework for enterprise-wide monitoring. The toolkit permits the capture of monitoring information at different levels of granularity, ranging from tracking the system-level activities triggered by a single system call, to capturing the client-server interactions associated with certain request classes, to characterizing the server resources consumed by sets of clients or client behaviors. The paper demonstrates the efficacy of SysProf by using it to manage two different enterprise applications: (1) detecting performance bottlenecks in a high performance shared network file service, and (2) enforcing service level agreements in a multi-tier auctioning web site.

[1]  Richard Mortier,et al.  Using Magpie for Request Extraction and Workload Modelling , 2004, OSDI.

[2]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[3]  Kang G. Shin,et al.  Stateful distributed interposition , 2004, TOCS.

[4]  Willy Zwaenepoel,et al.  Performance and scalability of EJB applications , 2002, OOPSLA '02.

[5]  Greg Eisenhauer Dynamic Code Generation with the E-Code Language , 2005 .

[6]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[7]  David Mosberger,et al.  httperf—a tool for measuring web server performance , 1998, PERV.

[8]  Peter Druschel,et al.  Resource containers: a new facility for resource management in server systems , 1999, OSDI '99.

[9]  Wu-chun Feng,et al.  MAGNET: a tool for debugging, analyzing and adapting computing systems , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[10]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[11]  Barton P. Miller,et al.  Fine-grained dynamic instrumentation of commodity operating system kernels , 1999, OSDI '99.

[12]  Karsten Schwan,et al.  Window-Constrained Process Scheduling for Linux Systems , 2002 .

[13]  Allen D. Malony,et al.  Performance Measurement Intrusion and Perturbation Analysis , 1992, IEEE Trans. Parallel Distributed Syst..

[14]  Bryan Cantrill,et al.  Dynamic Instrumentation of Production Systems , 2004, USENIX Annual Technical Conference, General Track.

[15]  Joseph L. Hellerstein,et al.  ETE: a customizable approach to measuring end-to-end response times and their components in distributed systems , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[16]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[17]  Jeffrey S. Vetter,et al.  Dynamic statistical profiling of communication activity in distributed applications , 2002, SIGMETRICS '02.

[18]  David Sinreich,et al.  An architectural blueprint for autonomic computing , 2006 .

[19]  Christian Poellabauer,et al.  Resource-aware stream management with the customizable dproc distributed monitoring mechanisms , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[20]  Amin Vahdat,et al.  Interposed request routing for scalable network storage , 2000, TOCS.

[21]  Michel Dagenais,et al.  Measuring and Characterizing System Behavior Using Kernel-Level Event Logging , 2000, USENIX Annual Technical Conference, General Track.

[22]  Gustavo Alonso,et al.  Web Services: Concepts, Architectures and Applications , 2009 .

[23]  Alan L. Cox,et al.  Causeway: Support for Controlling and Analyzing the Execution of Web-Accessible Applications , 2005 .

[24]  Jeffrey C. Mogul,et al.  Operating Systems Should Support Business Change , 2005, HotOS.

[25]  Richard J. Moore A Universal Dynamic Trace for Linux and Other Operating Systems , 2001, USENIX Annual Technical Conference, FREENIX Track.

[26]  Margo I. Seltzer,et al.  Improving interactive performance using TIPME , 2000, SIGMETRICS '00.

[27]  David E. Culler,et al.  The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..