NetLogger: A Toolkit for Distributed System Performance Tuning and Debugging

Developers and users of high-performance distributed systems often observe performance problems such as unexpectedly low throughput or high latency. Determining the source of the performance problems requires detailed end-to-end instrumentation of all components, including the applications, operating systems, hosts, and networks. In this paper we describe a methodology that enables the real-time diagnosis of performance problems in complex high-performance distributed systems. The methodology includes tools for generating timestamped event logs that can be used to provide detailed end-to-end application and system level monitoring; and tools for visualizing the log data and real-time state of the distributed system. This methodology, called NetLogger, has proven invaluable for diagnosing problems in networks and in distributed systems code. This approach is novel in that it combines network, host, and application-level monitoring, providing a complete view of the entire system. NetLogger is designed to be extremely lightweight, and includes a mechanism for reliably collecting monitoring events from multiple distributed locations.

[1]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[2]  Ian T. Foster,et al.  Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing , 2001, 2001 Eighteenth IEEE Symposium on Mass Storage Systems and Technologies.

[3]  Brian Tierney,et al.  NetLogger: A Toolkit for Distributed System Performance Tuning and Debugging , 2003, Integrated Network Management.

[4]  Jason Lee,et al.  Dynamic monitoring of high-performance distributed applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[5]  Jeffrey S. Vetter,et al.  Autopilot: adaptive control of distributed applications , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[6]  William E. Johnston,et al.  Performance Analysis in High-Speed Wide Area IP over ATM Networks: Top-to-Bottom End-to-End Monitoring , 1996 .

[7]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[8]  Jason Lee,et al.  Monitoring Data Archives for Grid Environments , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[9]  Raj Srinivasan,et al.  XDR: External Data Representation Standard , 1995, RFC.

[10]  Brian Tierney,et al.  Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System , 1997 .

[11]  Greg Eisenhauer,et al.  Fast heterogeneous binary data interchange , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[12]  Karsten Schwan,et al.  Event Services in High Performance Systems , 2001, Cluster Computing.

[13]  William E. Johnston,et al.  The NetLogger methodology for high performance distributed systems performance analysis , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[14]  David L. Mills Simple Network Time Protocol (SNTP) , 1992, RFC.

[15]  Jason Lee,et al.  Using High-Speed WANs and Network Data Caches to Enable Remote and Distributed Visualization , 2000, ACM/IEEE SC 2000 Conference (SC'00).