Performance Prediction for Large-Scale Parallel Applications Using Representative Replay

Automatically predicting performance of parallel applications has been a long-standing goal in the area of high performance computing. However, accurate performance prediction is challenging, since the execution time of parallel applications is determined by several factors, such as sequential computation time, communication time and their complex interactions. Despite previous efforts, accurately estimating the sequential computation time in each process for large-scale parallel applications remains an open problem. In this paper, we propose a novel approach to acquiring accurate sequential computation time using a parallel debugging technique called deterministic replay. The main advantage of our approach is that we only need a single node of a target platform but the whole target platform does not need to be available. Therefore, with this approach we can simply measure the real sequential computation time on a target node for each process on by one. Moreover, we observe that there is great computation similarity in parallel applications, not only within each process but also among different processes. Based on this observation, we further propose representative replay that can significantly reduce replay overhead, because we only need to replay partial iterations for representative processes instead of all of them. Finally, we implement a complete performance prediction system, called Phantom, which combines the above computation-time acquisition approach and a trace-driven simulator. We validate our approach on both traditional HPC platforms and the latest Amazon EC2 cloud platform. On both types of platforms, prediction error of our approach is less than 7 percent on average up to 2,500 processes.

[1]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[2]  Yuichi Inadomi,et al.  Performance prediction of large-scale parallell system and application using macro-level simulation , 2008, HiPC 2008.

[3]  Wenguang Chen,et al.  Cloud versus in-house cluster: Evaluating Amazon cluster compute instances for running MPI applications , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Adolfy Hoisie,et al.  A performance model of non-deterministic particle transport on large-scale systems , 2006, Future Gener. Comput. Syst..

[5]  Venkatram Vishwanath,et al.  Dataflow-driven GPU performance projection for multi-kernel transformations , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Yuichi Inadomi,et al.  Performance prediction of large-scale parallell system and application using macro-level simulation , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Gengbin Zheng,et al.  Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing , 2005 .

[8]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[9]  Chen Ding,et al.  Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.

[10]  Fabrizio Petrini,et al.  Predictive Performance and Scalability Modeling of a Large-Scale Application , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[11]  Martin Schulz,et al.  Large scale debugging of parallel tasks with AutomaDeD , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Bernd Hoefflinger High-Performance Computing ( HPC ) , 2016 .

[13]  Jeffrey S. Vetter,et al.  Aspen: A domain specific language for performance modeling , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Mark M. Mathis,et al.  A performance model of non-deterministic particle transport on large-scale systems , 2003, Future Gener. Comput. Syst..

[15]  Martin Schulz,et al.  Stack Trace Analysis for Large Scale Debugging , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[16]  Fabrizio Petrini,et al.  A general predictive performance model for wavefront algorithms on clusters of SMPs , 2000, Proceedings 2000 International Conference on Parallel Processing.

[17]  Chris J. Scheiman,et al.  LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation , 1997, J. Parallel Distributed Comput..

[18]  Sally A. McKee,et al.  Methods of inference and learning for performance modeling of parallel applications , 2007, PPoPP.

[19]  Scott Pakin,et al.  A Performance Model of the Krak Hydrodynamics Application , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[20]  Laura Carrington,et al.  A Framework for Application Performance Modeling and Prediction , 2002 .

[21]  Mary K. Vernon,et al.  Predictive analysis of a wavefront application using LogGP , 1999, PPoPP '99.

[22]  Jesús Labarta,et al.  DiP: A Parallel Program Development Environment , 1996, Euro-Par, Vol. II.

[23]  Wenguang Chen,et al.  MPIWiz: subgroup reproducible replay of mpi applications , 2009, PPoPP '09.

[24]  Frank Mueller,et al.  ScalaExtrap: trace-based communication extrapolation for spmd programs , 2011, PPoPP '11.

[25]  Wenguang Chen,et al.  FACT: fast communication trace collection for parallel applications through program slicing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[26]  Rami G. Melhem,et al.  A compiler-based communication analysis approach for multiprocessor systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[27]  Laxmikant V. Kalé,et al.  Robust non-intrusive record-replay with processor extraction , 2010, PADTAD '10.

[28]  Hiroshi Nakashima,et al.  Parallel Program Debugging based on Data-Replay , 2005, IASTED PDCS.

[29]  George Bosilca,et al.  Retrospect: Deterministic Replay of MPI Applications for Interactive Distributed Debugging , 2007, PVM/MPI.

[30]  Frank Mueller,et al.  Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[31]  Paul D. Gader,et al.  Image algebra techniques for parallel image processing , 1987 .

[32]  Rajive L. Bagrodia,et al.  MPI-SIM: using parallel simulation to evaluate MPI programs , 1998, 1998 Winter Simulation Conference. Proceedings (Cat. No.98CH36274).

[33]  David H. Bailey,et al.  The NAS Parallel Benchmarks 2.0 , 2015 .

[34]  Jin Zhang,et al.  Process Mapping for MPI Collective Communications , 2009, Euro-Par.

[35]  Thomas J. LeBlanc,et al.  Debugging Parallel Programs with Instant Replay , 1987, IEEE Transactions on Computers.

[36]  Laxmikant V. Kalé,et al.  BigSim: a parallel simulator for performance prediction of extremely large parallel machines , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[37]  Martin Schulz,et al.  A regression-based approach to scalability prediction , 2008, ICS '08.

[38]  Wenguang Chen,et al.  PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node , 2010, PPoPP '10.

[39]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.