Optimizing I/O Performance of HPC Applications with Autotuning

Parallel Input output is an essential component of modern high-performance computing (HPC). Obtaining good I/O performance for a broad range of applications on diverse HPC platforms is a major challenge, in part, because of complex inter dependencies between I/O middleware and hardware. The parallel file system and I/O middleware layers all offer optimization parameters that can, in theory, result in better I/O performance. Unfortunately, the right combination of parameters is highly dependent on the application, HPC platform, problem size, and concurrency. Scientific application developers do not have the time or expertise to take on the substantial burden of identifying good parameters for each problem configuration. They resort to using system defaults, a choice that frequently results in poor I/O performance. We expect this problem to be compounded on exascale-class machines, which will likely have a deeper software stack with hierarchically arranged hardware resources. We present as a solution to this problem an autotuning system for optimizing I/O performance, I/O performance modeling, I/O tuning, and I/O patterns. We demonstrate the value of this framework across several HPC platforms and applications at scale.

[1]  A. Adelmann,et al.  Progress on H5Part: a portable high performance parallel data interface for electromagnetics simulations , 2007, 2007 IEEE Particle Accelerator Conference (PAC).

[2]  Arif Merchant,et al.  Minerva: An automated resource provisioning tool for large-scale storage systems , 2001, TOCS.

[3]  Akio Arakawa,et al.  CLOUDS AND CLIMATE: A PROBLEM THAT REFUSES TO DIE. Clouds of many , 2022 .

[4]  Kalyanmoy Deb,et al.  A Computationally Efficient Evolutionary Algorithm for Real-Parameter Optimization , 2002, Evolutionary Computation.

[5]  Harvey Richardson,et al.  High Performance Fortran: history, overview and current developments , 1996 .

[6]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[7]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[8]  Andrew A. Chien,et al.  Performance Modeling of a Parallel I/O System: An Application Driven Approach , 1997, PPSC.

[9]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[10]  Robert Latham,et al.  I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[11]  Avishek Saha,et al.  Characterization and modeling of PIDX parallel I/O for performance optimization , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Alden H. Wright,et al.  Genetic Algorithms for Real Parameter Optimization , 1990, FOGA.

[13]  Marianne Winslett,et al.  Automatic parallel I/O performance optimization using genetic algorithms , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[14]  Weizhe Zhang,et al.  Automatic Generation of I/O Kernels for HPC Applications , 2014, 2014 9th Parallel Data Storage Workshop.

[15]  Surendra Byna,et al.  Improving parallel I/O autotuning with performance modeling , 2014, HPDC '14.

[16]  Arie Shoshani,et al.  Parallel I/O, analysis, and visualization of a trillion particle simulation , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Wei-keng Liao,et al.  Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Qing Liu,et al.  The Design of an Auto-Tuning I / O Framework on Cray XT 5 System , 2011 .

[19]  Thomas Bäck,et al.  An Overview of Evolutionary Algorithms for Parameter Optimization , 1993, Evolutionary Computation.

[20]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[21]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[22]  Daniel A. Reed,et al.  A Comparison of Logical and Physical Parallel I/o pAtterns , 1998, Int. J. High Perform. Comput. Appl..

[23]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[24]  Dror G. Feitelson,et al.  Overview of the MPI-IO Parallel I/O Interface , 1996, Input/Output in Parallel and Distributed Computer Systems.

[25]  Samuel Williams,et al.  PERI - auto-tuning memory-intensive kernels for multicore , 2008 .

[26]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Marianne Winslett,et al.  A multi-level approach for understanding I/O activity in HPC applications , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[28]  Evgenia Smirni,et al.  Lessons from Characterizing the Input/Output Behavior of Parallel Scientific Applications , 1998, Perform. Evaluation.

[29]  Michael A. Laurenzano,et al.  Modeling and Predicting Disk I/O Time of HPC Applications , 2010, 2010 DoD High Performance Computing Modernization Program Users Group Conference.

[30]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[31]  Carlos Maltzahn,et al.  I/O acceleration with pattern detection , 2013, HPDC.

[32]  Ananta Tiwari,et al.  Online Adaptive Code Generation and Tuning , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[33]  K. Bowers,et al.  Ultrahigh performance three-dimensional electromagnetic relativistic kinetic plasma simulationa) , 2008 .

[34]  Ray W. Grout,et al.  Skel: Generative Software for Producing Skeletal I/O Applications , 2011, 2011 IEEE Seventh International Conference on e-Science Workshops.

[35]  Houjun Tang,et al.  Parallel In Situ Detection of Connected Components in Adaptive Mesh Refinement Data , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[36]  Marianne Winslett,et al.  Automatic parallel I/O performance optimization in Panda , 1998, SPAA '98.

[37]  John Shalf,et al.  Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  Eric Anderson,et al.  Proceedings of the Fast 2002 Conference on File and Storage Technologies Hippodrome: Running Circles around Storage Administration , 2022 .

[39]  Surendra Byna,et al.  Taming parallel I/O complexity with auto-tuning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[40]  Francine Berman,et al.  Heuristics for scheduling parameter sweep applications in grid environments , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[41]  J. Cary,et al.  VORPAL: a versatile plasma simulation code , 2004 .

[42]  Robert B. Ross,et al.  Omnisc'IO: A Grammar-Based Approach to Spatial and Temporal I/O Patterns Prediction , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  Thomas Fahringer,et al.  A multi-objective auto-tuning framework for parallel codes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[44]  Marianne Winslett,et al.  Performance Modeling for the Panda Array I/O Library , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[45]  Surendra Byna,et al.  Parallel I/O prefetching using MPI file caching and I/O signatures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[46]  Christos Faloutsos,et al.  Using Utility to Provision Storage Systems , 2008, FAST.

[47]  Alok N. Choudhary,et al.  Improved parallel I/O via a two-phase run-time access strategy , 1993, CARN.

[48]  Xian-He Sun,et al.  Cost-intelligent application-specific data layout optimization for parallel file systems , 2013, Cluster Computing.

[49]  Jack J. Dongarra,et al.  A comparison of search heuristics for empirical code optimization , 2008, 2008 IEEE International Conference on Cluster Computing.