Modeling I/O Performance Variability Using Conditional Variational Autoencoders

Storage system performance modeling is crucial for efficient use of heterogeneous shared resources on leadership-class computers. Variability in application performance, particularly variability arising from concurrent applications sharing I/O resources, is a major hurdle in the development of accurate performance models. We adopt a deep learning approach based on conditional variational auto encoders (CVAE) for I/O performance modeling, and use it to quantify performance variability. We illustrate our approach using the data collected on Edison, a production supercomputing system at the National Energy Research Scientific Computing Center (NERSC). The CVAE approach is investigated by comparing it to a previously proposed sensitivity-based Gaussian process (GP) model. We find that the CVAE model performs slightly better than the GP model in cases where training and testing data come from different applications, since CVAE can inherently leverage the whole data from multiple applications whereas GP partitions the data and builds separate models for each partition. Hence, the CVAE offers an alternative modeling approach that does not need pre-processing; it has enough flexibility to handle data from a wide variety of applications without changing the inference approach.

[1]  Florin Isaila,et al.  Collective I/O Tuning Using Analytical and Machine Learning Models , 2015, 2015 IEEE International Conference on Cluster Computing.

[2]  Robert B. Ross,et al.  On the Root Causes of Cross-Application I/O Interference in HPC Storage Systems , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[3]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[4]  Mario A. R. Dantas,et al.  A Statistical Analysis of the Performance Variability of Read/Write Operations on Parallel File Systems , 2017, ICCS.

[5]  Jun Zhu,et al.  Conditional Generative Moment-Matching Networks , 2016, NIPS.

[6]  Robert B. Ross,et al.  CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[7]  Robert Latham,et al.  Analysis and Correlation of Application I/O Performance and System-Wide I/O Activity , 2017, 2017 International Conference on Networking, Architecture, and Storage (NAS).

[8]  Karsten Schwan,et al.  Managing Variability in the IO Performance of Petascale Storage Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  B. Mallick,et al.  Analyzing Nonstationary Spatial Data Using Piecewise Gaussian Processes , 2005 .

[10]  Robert B. Gramacy,et al.  Ja n 20 08 Bayesian Treed Gaussian Process Models with an Application to Computer Modeling , 2009 .

[11]  Randy H. Katz,et al.  An analytic performance model of disk arrays , 1993, SIGMETRICS '93.

[12]  Julian M. Kunkel,et al.  Predicting Performance of Non-contiguous I/O with Machine Learning , 2015, ISC.

[13]  Katerina Fragkiadaki,et al.  Motion Prediction Under Multimodality with Conditional Stochastic Networks , 2017, ArXiv.

[14]  Kamalika Das,et al.  Block-GP: Scalable Gaussian Process Regression for Multimodal Data , 2010, 2010 IEEE International Conference on Data Mining.

[15]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[16]  Maxine Eskénazi,et al.  Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders , 2017, ACL.

[17]  Surendra Byna,et al.  Improving parallel I/O autotuning with performance modeling , 2014, HPDC '14.

[18]  Seung Woo Son,et al.  Reducing I/O variability using dynamic I/O path characterization in petascale storage systems , 2016, The Journal of Supercomputing.

[19]  Scott Klasky,et al.  Predicting Output Performance of a Petascale Supercomputer , 2017, HPDC.

[20]  Robert Latham,et al.  Machine Learning Based Parallel I/O Predictive Modeling: A Case Study on Lustre File Systems , 2018, ISC.

[21]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[22]  Scott Klasky,et al.  Storage Systems and Input/Output to Support Extreme Scale Science , 2015 .