Taming parallel I/O complexity with auto-tuning

We present an auto-tuning system for optimizing I/O performance of HDF5 applications and demonstrate its value across platforms, applications, and at scale. The system uses a genetic algorithm to search a large space of tunable parameters and to identify effective settings at all layers of the parallel I/O stack. The parameter settings are applied transparently by the auto-tuning system via dynamically intercepted HDF5 calls. To validate our auto-tuning system, we applied it to three I/O benchmarks (VPIC, VORPAL, and GCRM) that replicate the I/O activity of their respective applications. We tested the system with different weak-scaling configurations (128, 2048, and 4096 CPU cores) that generate 30 GB to 1 TB of data, and executed these configurations on diverse HPC platforms (Cray XE6, IBM BG/P, and Dell Cluster). In all cases, the auto-tuning framework identified tunable parameters that substantially improved write performance over default system settings. We consistently demonstrate I/O write speedups between 2× and 100× for test configurations.

[1]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[2]  Akio Arakawa,et al.  CLOUDS AND CLIMATE: A PROBLEM THAT REFUSES TO DIE. Clouds of many , 2022 .

[3]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[4]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Samuel Williams,et al.  PERI - auto-tuning memory-intensive kernels for multicore , 2008 .

[6]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Thomas Bäck,et al.  An Overview of Evolutionary Algorithms for Parameter Optimization , 1993, Evolutionary Computation.

[8]  Arie Shoshani,et al.  Parallel I/O, analysis, and visualization of a trillion particle simulation , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Christian S. Perone,et al.  Pyevolve: a Python open-source framework for genetic algorithms , 2009, SEVO.

[10]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[11]  Jeffrey S. Vetter,et al.  Performance characterization and optimization of parallel I/O on the Cray XT , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[12]  A. Adelmann,et al.  Progress on H5Part: a portable high performance parallel data interface for electromagnetics simulations , 2007, 2007 IEEE Particle Accelerator Conference (PAC).

[13]  Alden H. Wright,et al.  Genetic Algorithms for Real Parameter Optimization , 1990, FOGA.

[14]  Arif Merchant,et al.  Minerva: An automated resource provisioning tool for large-scale storage systems , 2001, TOCS.

[15]  Robert Latham,et al.  I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[16]  John Shalf,et al.  Tuning HDF5 for Lustre File Systems , 2010 .

[17]  Qing Liu,et al.  The Design of an Auto-Tuning I / O Framework on Cray XT 5 System , 2011 .

[18]  K. Bowers,et al.  Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulationa) , 2008 .

[19]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[20]  Eric Anderson,et al.  Proceedings of the Fast 2002 Conference on File and Storage Technologies Hippodrome: Running Circles around Storage Administration , 2022 .

[21]  Marianne Winslett,et al.  Performance Modeling for the Panda Array I/O Library , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[22]  Jack J. Dongarra,et al.  A comparison of search heuristics for empirical code optimization , 2008, 2008 IEEE International Conference on Cluster Computing.

[23]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[24]  Kalyanmoy Deb,et al.  A Computationally Efficient Evolutionary Algorithm for Real-Parameter Optimization , 2002, Evolutionary Computation.

[25]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[26]  Ananta Tiwari,et al.  Online Adaptive Code Generation and Tuning , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[27]  Surendra Byna,et al.  A framework for auto-tuning HDF5 applications , 2013, HPDC.

[28]  Marianne Winslett,et al.  Automatic parallel I/O performance optimization using genetic algorithms , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[29]  Marianne Winslett,et al.  Automatic parallel I/O performance optimization in Panda , 1998, SPAA '98.

[30]  Christos Faloutsos,et al.  Using Utility to Provision Storage Systems , 2008, FAST.

[31]  Francine Berman,et al.  Heuristics for scheduling parameter sweep applications in grid environments , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[32]  J. Cary,et al.  VORPAL: a versatile plasma simulation code , 2004 .

[33]  Thomas Fahringer,et al.  A multi-objective auto-tuning framework for parallel codes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.