Parallel File System Analysis Through Application I/O Tracing

Input/Output (I/O) operations can represent a significant proportion of the run-time of parallel scientific computing applications. Although there have been several advances in file format libraries, file system design and I/O hardware, a growing divergence exists between the performance of parallel file systems and the compute clusters that they support. In this paper, we document the design and application of the RIOT I/O toolkit (RIOT) being developed at the University of Warwick with our industrial partners at the Atomic Weapons Establishment and Sandia National Laboratories. We use the toolkit to assess the performance of three industry-standard I/O benchmarks on three contrasting supercomputers, ranging from a mid-sized commodity cluster to a large-scale proprietary IBM BlueGene/P system. RIOT provides a powerful framework in which to analyse I/O and parallel file system behaviour—we demonstrate, for example, the large file locking overhead of IBM's General Parallel File System, which can consume nearly 30% of the total write time in the FLASH-IO benchmark. Through I/O trace analysis, we also assess the performance of HDF-5 in its default configuration, identifying a bottleneck created by the use of suboptimal Message Passing Interface hints. Furthermore, we investigate the performance gains attributed to the Parallel Log-structured File System (PLFS) being developed by EMC Corporation and the Los Alamos National Laboratory. Our evaluation of PLFS involves two high-performance computing systems with contrasting I/O backplanes and illustrates the varied improvements to I/O that result from the deployment of PLFS (ranging from up to 25× speed-up in I/O performance on a large I/O installation to 2× speed-up on the much smaller installation at the University of Warwick).

[1]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[2]  John Shalf,et al.  Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  P. Nowoczynski,et al.  Zest Checkpoint storage system for large supercomputers , 2008, 2008 3rd Petascale Data Storage Workshop.

[4]  Nicholas J. Wright,et al.  Effective Performance Measurement at Petascale Using IPM , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[5]  David Kaeli,et al.  Source level transformations to improve I/O data partitioning , 2003, SNAPI '03.

[6]  Hai Jin,et al.  Collective Buffering: Improving Parallel I/O Performance , 2002 .

[7]  David R. Kaeli,et al.  Profile-guided I/O partitioning , 2003, ICS '03.

[8]  E. Lusk,et al.  An abstract-device interface for implementing portable parallel-I/O interfaces , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[9]  B. Wolman,et al.  IOBENCH: a system independent IO benchmark , 1989, CARN.

[10]  Rajeev Thakur,et al.  Users guide for ROMIO: A high-performance, portable MPI-IO implementation , 1997 .

[11]  John Shalf,et al.  Using IOR to analyze the I/O Performance for HPC Platforms , 2007 .

[12]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[13]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[14]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[15]  Stephen A. Jarvis,et al.  RIOT - A Parallel Input/Output Tracer , 2011 .

[16]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[17]  Jianwei Li,et al.  Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[18]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[19]  Stephen A. Jarvis,et al.  Light-Weight Parallel I/O Analysis at Scale , 2011, EPEW.

[20]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[21]  Robert Latham,et al.  24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[22]  Michael Zingale,et al.  Flash code: studying astrophysical thermonuclear flashes , 2000, Comput. Sci. Eng..

[23]  William Gropp,et al.  MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.

[24]  M. Polte,et al.  Fast log-based concurrent writing of checkpoints , 2008, 2008 3rd Petascale Data Storage Workshop.

[25]  Karsten Schwan,et al.  ...and eat it too: high read performance in write-optimized HPC I/O middleware file formats , 2009, PDSW '09.

[26]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[27]  B. Fryxell,et al.  FLASH: An Adaptive Mesh Hydrodynamics Code for Modeling Astrophysical Thermonuclear Flashes , 2000 .

[28]  Russ Rew,et al.  NetCDF: an interface for scientific data access , 1990, IEEE Computer Graphics and Applications.

[29]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[30]  Leonid Oliker,et al.  Parallel I/O performance: From events to ensembles , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[31]  William Gropp,et al.  Implementing MPI on the BlueGene/L Supercomputer , 2004, Euro-Par.