论文信息 - Taming parallel I/O complexity with auto-tuning

Taming parallel I/O complexity with auto-tuning

We present an auto-tuning system for optimizing I/O performance of HDF5 applications and demonstrate its value across platforms, applications, and at scale. The system uses a genetic algorithm to search a large space of tunable parameters and to identify effective settings at all layers of the parallel I/O stack. The parameter settings are applied transparently by the auto-tuning system via dynamically intercepted HDF5 calls. To validate our auto-tuning system, we applied it to three I/O benchmarks (VPIC, VORPAL, and GCRM) that replicate the I/O activity of their respective applications. We tested the system with different weak-scaling configurations (128, 2048, and 4096 CPU cores) that generate 30 GB to 1 TB of data, and executed these configurations on diverse HPC platforms (Cray XE6, IBM BG/P, and Dell Cluster). In all cases, the auto-tuning framework identified tunable parameters that substantially improved write performance over default system settings. We consistently demonstrate I/O write speedups between 2× and 100× for test configurations.

[1] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[2] Akio Arakawa,et al. CLOUDS AND CLIMATE: A PROBLEM THAT REFUSES TO DIE. Clouds of many , 2022 .

[3] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.

[4] Steven G. Johnson,et al. FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5] Samuel Williams,et al. PERI - auto-tuning memory-intensive kernels for multicore , 2008 .

[6] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[7] Thomas Bäck,et al. An Overview of Evolutionary Algorithms for Parameter Optimization , 1993, Evolutionary Computation.

[8] Arie Shoshani,et al. Parallel I/O, analysis, and visualization of a trillion particle simulation , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[9] Christian S. Perone,et al. Pyevolve: a Python open-source framework for genetic algorithms , 2009, SEVO.

[10] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[11] Jeffrey S. Vetter,et al. Performance characterization and optimization of parallel I/O on the Cray XT , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[12] A. Adelmann,et al. Progress on H5Part: a portable high performance parallel data interface for electromagnetics simulations , 2007, 2007 IEEE Particle Accelerator Conference (PAC).

[13] Alden H. Wright,et al. Genetic Algorithms for Real Parameter Optimization , 1990, FOGA.

[14] Arif Merchant,et al. Minerva: An automated resource provisioning tool for large-scale storage systems , 2001, TOCS.

[15] Robert Latham,et al. I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[16] John Shalf,et al. Tuning HDF5 for Lustre File Systems , 2010 .

[17] Qing Liu,et al. The Design of an Auto-Tuning I / O Framework on Cray XT 5 System , 2011 .

[18] K. Bowers,et al. Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulationa) , 2008 .

[19] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[20] Eric Anderson,et al. Proceedings of the Fast 2002 Conference on File and Storage Technologies Hippodrome: Running Circles around Storage Administration , 2022 .

[21] Marianne Winslett,et al. Performance Modeling for the Panda Array I/O Library , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[22] Jack J. Dongarra,et al. A comparison of search heuristics for empirical code optimization , 2008, 2008 IEEE International Conference on Cluster Computing.

[23] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[24] Kalyanmoy Deb,et al. A Computationally Efficient Evolutionary Algorithm for Real-Parameter Optimization , 2002, Evolutionary Computation.

[25] Rajeev Thakur,et al. Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[26] Ananta Tiwari,et al. Online Adaptive Code Generation and Tuning , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[27] Surendra Byna,et al. A framework for auto-tuning HDF5 applications , 2013, HPDC.

[28] Marianne Winslett,et al. Automatic parallel I/O performance optimization using genetic algorithms , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[29] Marianne Winslett,et al. Automatic parallel I/O performance optimization in Panda , 1998, SPAA '98.

[30] Christos Faloutsos,et al. Using Utility to Provision Storage Systems , 2008, FAST.

[31] Francine Berman,et al. Heuristics for scheduling parameter sweep applications in grid environments , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[32] J. Cary,et al. VORPAL: a versatile plasma simulation code , 2004 .

[33] Thomas Fahringer,et al. A multi-objective auto-tuning framework for parallel codes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.