Using Small-Scale History Data to Predict Large-Scale Performance of HPC Application

Performance modeling is an important problem in high-performance computing (HPC). Machine Learning (ML) is a powerful approach for HPC performance modeling. ML can learn complex relations between application parameters and the performance of HPC applications from historical execution data. However, extrapolation of large-scale performance with only small-scale execution data using ML is difficult, because the independent and identically distributed hypothesis (the basic hypothesis of most ML algorithms) does not hold in this situation. To solve the extrapolation problem, we propose a two-level model consisting of interpolation level and extrapolation level. The interpolation level predicts small-scale performance with small-scale execution. The extrapolation level predicts the large-scale performance of the fixed input parameter with its small-scale performance predictions. We use the random forest to build interpolation models to predict small-scale performance in the interpolation level. In the extrapolation level, to reduce the negative influence of interpolation errors, we employ the multitask lasso with clustering to construct the scalability models to predict large-scale performance. To validate the utility of our two-level model, we conduct experiments on a real HPC platform. We build models for two HPC applications using our two-level model. Compare with existing ML methods, our method can achieve higher prediction accuracy.

[1]  Torsten Hoefler,et al.  PEMOGEN: Automatic adaptive performance modeling during program runtime , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[2]  Martin Schulz,et al.  Modeling the performance of an algebraic multigrid cycle on HPC platforms , 2011, ICS '11.

[3]  Torsten Hoefler,et al.  Performance modeling for systematic performance tuning , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Yu Zhang,et al.  A Survey on Multi-Task Learning , 2017, IEEE Transactions on Knowledge and Data Engineering.

[5]  Fabrizio Petrini,et al.  Predictive Performance and Scalability Modeling of a Large-Scale Application , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[6]  Torsten Hoefler,et al.  Using automated performance modeling to find scalability bugs in complex codes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Prasanna Balaprakash,et al.  Benchmarking Machine Learning Methods for Performance Modeling of Scientific Applications , 2018, 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Sally A. McKee,et al.  Methods of inference and learning for performance modeling of parallel applications , 2007, PPoPP.

[10]  Haibo Chen,et al.  COREMU: a scalable and portable parallel full-system emulator , 2011, PPoPP '11.

[11]  Duong Ngoc Hieu,et al.  A Machine Learning-Based Approach for Predicting the Execution Time of CFD Applications on Cloud Computing Environment , 2016, FDSE.

[12]  Miriam Leeser,et al.  FIM: Performance Prediction for Parallel Computation in Iterative Data Processing Applications , 2017, 2017 IEEE 10th International Conference on Cloud Computing (CLOUD).

[13]  Martin Schulz,et al.  A regression-based approach to scalability prediction , 2008, ICS '08.

[14]  Laxmikant V. Kalé,et al.  BigSim: a parallel simulator for performance prediction of extremely large parallel machines , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[15]  Darren J. Kerbyson,et al.  Analysis of the Weather Research and Forecasting (WRF) Model on Large-Scale Systems , 2007, PARCO.

[16]  Wenguang Chen,et al.  PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node , 2010, PPoPP '10.

[17]  Tran Ngoc Minh,et al.  Using Historical Data to Predict Application Runtimes on Backfilling Parallel Systems , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[18]  Hui He,et al.  Performance modeling for MPI applications with low overhead fine-grained profiling , 2019, Future Gener. Comput. Syst..

[19]  Frank Mueller,et al.  ScalaExtrap: trace-based communication extrapolation for spmd programs , 2011, PPoPP '11.

[20]  Mats Brorsson,et al.  Regression-Based Prediction for Task-Based Program Performance , 2019, J. Circuits Syst. Comput..

[21]  Ananta Tiwari,et al.  Inferring Large-Scale Computation Behavior via Trace Extrapolation , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.