Interpreting Write Performance of Supercomputer I/O Systems with Regression Models

This work seeks to advance the state of the art in HPC I/O performance analysis and interpretation. In particular, we demonstrate effective techniques to: (1) model output performance in the presence of I/O interference from production loads; (2) build features from write patterns and key parameters of the system architecture and configurations; (3) employ suitable machine learning algorithms to improve model accuracy. We train models with five popular regression algorithms and conduct experiments on two distinct production HPC platforms. We find that the lasso and random forest models predict output performance with high accuracy on both of the target systems. We also explore use of the models to guide adaptation in I/O middleware systems, and show potential for improvements of at least 15% from model-guided adaptation on 70% of samples, and improvements up to $10 \times$ on some samples for both of the target systems.

[1]  Scott Klasky,et al.  Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[2]  Devarshi Ghoshal,et al.  Data Jockey: Automatic Data Management for HPC Multi-tiered Storage Systems , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[3]  Julian M. Kunkel,et al.  Predicting Performance of Non-contiguous I/O with Machine Learning , 2015, ISC.

[4]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[5]  T. Hahm,et al.  Turbulent transport reduction by zonal flows: massively parallel simulations , 1998, Science.

[6]  Kevin Harms,et al.  UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis , 2017, PDSW-DISCS@SC.

[7]  Arie Shoshani,et al.  Toward a first-principles integrated simulation of tokamak edge plasmas , 2008 .

[8]  Kevin Harms,et al.  Applying Machine Learning to Understand Write Performance of Large-scale Parallel Filesystems , 2019, 2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW).

[9]  Scott Klasky,et al.  Predicting Output Performance of a Petascale Supercomputer , 2017, HPDC.

[10]  Scott Klasky,et al.  Characterizing Output Bottlenecks of a Production Supercomputer , 2020, ACM Trans. Storage.

[11]  Robert B. Ross,et al.  Omnisc'IO: A Grammar-Based Approach to Spatial and Temporal I/O Patterns Prediction , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[13]  Ibm Redbooks,et al.  IBM System Blue Gene Solution: Blue Gene/P Application Development , 2009 .

[14]  Scott Klasky,et al.  Characterizing output bottlenecks in a supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Robert Latham,et al.  Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems , 2018, ISC.

[16]  Scott Klasky,et al.  Analysis and Modeling of the End-to-End I/O Performance on OLCF's Titan Supercomputer , 2017, 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[17]  Todd Gamblin,et al.  Machine Learning Predictions of Runtime and IO Traffic on High-End Clusters , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[18]  Louai Alarabi Summit , 2018, SIGSPATIAL Special.

[19]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[20]  Scott Klasky,et al.  Output Performance Study on a Production Petascale Filesystem , 2017, ISC Workshops.

[21]  Bing Xie,et al.  Output Performance of Petascale File Systems , 2017 .

[22]  H. Arnold,et al.  Cetus , 2020, The Photographic Atlas of the Stars.

[23]  Galen M. Shipman,et al.  Workload characterization of a leadership class storage cluster , 2010, 2010 5th Petascale Data Storage Workshop (PDSW '10).