NWPerf: a system wide performance monitoring tool for large Linux clusters

We present NWPerf, a new system for analyzing fine granularity performance metric data on large-scale supercomputing clusters. This tool is able to measure application efficiency on a system wide basis from both a global system perspective as well as providing a detailed view of individual applications. NWPerf provides this service while minimizing the impact on the performance of user applications. We describe the type of information that can be derived from the system, and demonstrate how the system was used detect and eliminate a performance problem in an application application that improved performance by up to several thousand percent. The NWPerf architecture has proven to be a stable and scalable platform for gathering performance data on a large 1954-CPU production Linux cluster at PNNL.

[1]  Cathy H. Xia,et al.  Clock synchronization algorithms for network measurements , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[2]  Li Chen,et al.  Optimizing parallel performance of unstructured volume rendering for the Earth Simulator , 2003, Parallel Comput..

[3]  E. F. Codd,et al.  A data base sublanguage founded on the relational calculus , 1971, SIGFIDET '71.

[4]  Tobias Oetiker,et al.  MRTG: The Multi Router Traffic Grapher , 1998, LISA.

[5]  Rajkumar Buyya,et al.  PARMON: a portable and scalable monitoring system for clusters , 2000, Softw. Pract. Exp..

[6]  Donald D. Chamberlin,et al.  SEQUEL 2: A Unified Approach to Data Definition, Manipulation, and Control , 1976, IBM J. Res. Dev..

[7]  David E. Culler,et al.  Wide area cluster monitoring with Ganglia , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[8]  Marc Atkins,et al.  PC Software Performance Tuning , 1996, Computer.

[9]  Eric Anderson,et al.  Extensible, Scalable Monitoring for Clusters of Computers , 1997, LISA.

[10]  Youfeng Wu,et al.  Memory performance analysis of SPEC2000C for the Intel(R) Itanium/sup TM/ processor , 2001 .

[11]  Ming Q. Xu Effective metacomputing using LSF Multicluster , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[12]  Jesús Carretero,et al.  New techniques for collective communications in clusters: a case study with MPI , 2001, International Conference on Parallel Processing, 2001..

[13]  Barton P. Miller,et al.  Parallel program performance metrics: a comparison and validation , 1992, Proceedings Supercomputing '92.

[14]  Ronald Minnich,et al.  Supermon: a high-speed cluster monitoring system , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[15]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[16]  Bruce Momjian PostgreSQL performance tuning , 2001 .

[17]  Christian Poellabauer,et al.  Resource-aware stream management with the customizable dproc distributed monitoring mechanisms , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[18]  Rajkumar Buyya,et al.  PARMON: a portable and scalable monitoring system for clusters , 2000 .