Reflector: a fine-grained I/O tracker for HPC systems

We present Reflector, to support both high-level and low-level I/O monitoring through user-defined interfaces such as HDF5 and NetCDF in addition to POSIX- and MPI-IO. We evaluate Reflector on both an on-premises 500-core HPC cluster and a leadership-class supercomputer at the Lawrence Berkeley National Laboratory. Preliminary results are promising as the system prototype incurs negligible performance overhead and clearly illustrates the I/O patterns and bottlenecks of multiple applications.

[1]  S VetterJeffrey,et al.  Statistical scalability analysis of communication operations in distributed applications , 2001 .

[2]  Jeffrey S. Vetter,et al.  Statistical scalability analysis of communication operations in distributed applications , 2001, PPoPP '01.

[3]  Philip C. Roth,et al.  Characterizing the I/O behavior of scientific applications on the Cray XT , 2007, PDSW '07.

[4]  Garth A. Gibson Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07 , 2007 .

[5]  William Gropp,et al.  Toward Scalable Performance Visualization with Jumpshot , 1999, Int. J. High Perform. Comput. Appl..

[6]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Carla Schlatter Ellis,et al.  File-Access Characteristics of Parallel Scientific Workloads , 1996, IEEE Trans. Parallel Distributed Syst..

[8]  Martin Schulz,et al.  Stack Trace Analysis for Large Scale Debugging , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[9]  Robert Latham,et al.  24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[10]  William Gropp,et al.  An efficient format for nearly constant-time access to arbitrary time intervals in large trace files , 2008, Sci. Program..

[11]  Hao Yu,et al.  Early experiences in application level I/O tracing on blue gene systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[12]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..