Towards HPC I/O Performance Prediction through Large-scale Log Analysis

Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and storage units, while used by hundreds to thousands of users at the same time. Applications from these large numbers of users have diverse characteristics, such as varying compute, communication, memory, and I/O intensiveness. A good understanding of the performance characteristics of each user application is important for job scheduling and resource provisioning. Among these performance characteristics, the I/O performance is difficult to predict because the I/O system software is complex, the I/O system is shared among all users, and the I/O operations also heavily rely on networking systems. To improve the prediction of the I/O performance on HPC systems, we propose to integrate information from a number of different system logs and develop a regression-based approach that dynamically selects the most relevant features from the most recent log entries, and automatically select the best regression algorithm for the prediction task. Evaluation results show that our proposed scheme can predict the I/O performance with up to 84% prediction accuracy in the case of the I/O-intensive applications using the logs from CORI supercomputer at NERSC.

[1]  N. L. Johnson,et al.  Linear Statistical Inference and Its Applications , 1966 .

[2]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[3]  Sahibsingh A. Dudani The Distance-Weighted k-Nearest-Neighbor Rule , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[5]  John Shalf,et al.  Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Todd Gamblin,et al.  Machine Learning Predictions of Runtime and IO Traffic on High-End Clusters , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[8]  Surendra Byna,et al.  Improving parallel I/O autotuning with performance modeling , 2014, HPDC '14.

[9]  Odysseas I. Pentakalos An Introduction to the InfiniBand Architecture , 2002, Int. CMG Conference.

[10]  Darrell D. E. Long,et al.  The case for efficient file access pattern modeling , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[11]  Robert Latham,et al.  I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[12]  Robert B. Ross,et al.  Modular HPC I/O Characterization with Darshan , 2016, 2016 5th Workshop on Extreme-Scale Programming Tools (ESPT).

[13]  Kesheng Wu,et al.  DCA-IO: A Dynamic I/O Control Scheme for Parallel and Distributed File Systems , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[14]  José A. B. Fortes,et al.  On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[15]  Sankar K. Pal,et al.  Multilayer perceptron, fuzzy sets, and classification , 1992, IEEE Trans. Neural Networks.

[16]  Ben J. A. Kröse,et al.  Efficient Greedy Learning of Gaussian Mixture Models , 2003, Neural Computation.

[17]  Robert Latham,et al.  24/7 Characterization of petascale I/O workloads , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[18]  Naftali Tishby,et al.  Is Feature Selection Still Necessary? , 2005, SLSFS.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[21]  Kevin Harms,et al.  TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis , 2018 .

[22]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[23]  Shane Snyder,et al.  IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[24]  Surendra Byna,et al.  Pattern-driven parallel I/O tuning , 2015, PDSW '15.

[25]  Julian M. Kunkel,et al.  Predicting I/O Performance in HPC Using Artificial Neural Networks , 2016, Supercomput. Front. Innov..

[26]  Yili Hong,et al.  Predictive modeling of I/O characteristics in high performance computing systems , 2018, SpringSim.

[27]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[28]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[29]  Sang-Won Lee,et al.  SFS: random write considered harmful in solid state drives , 2012, FAST.

[30]  Shane Snyder,et al.  A Year in the Life of a Parallel File System , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[32]  Scott Klasky,et al.  Predicting Output Performance of a Petascale Supercomputer , 2017, HPDC.