Tuning parallel applications in parallel

In this paper, we present and evaluate a parallel algorithm for parameter tuning of parallel applications. We discuss the impact of performance variability on the accuracy and efficiency of the optimization algorithm and propose a strategy to minimize the impact of this variability. We evaluate our algorithm within the Active Harmony system, an automated online/offline tuning framework. We study its performance on three benchmark codes: PSTSWM, HPL and POP. Compared to the Nelder-Mead algorithm, our algorithm finds better configurations up to seven times faster. For POP, we were able to improve the performance of a production sized run by 59%.

[1]  K. I. M. McKinnon,et al.  Convergence of the Nelder-Mead Simplex Method to a Nonstationary Point , 1998, SIAM J. Optim..

[2]  William T. C. Kramer,et al.  Performance Variability of Highly Parallel Architectures , 2003, International Conference on Computational Science.

[3]  Mahadev Satyanarayanan,et al.  Agile application-aware adaptation for mobility , 1997, SOSP.

[4]  Mike Kotschenreuther,et al.  Comparison of initial value and eigenvalue codes for kinetic toroidal plasma instabilities , 1995 .

[5]  V. Torczon,et al.  RANK ORDERING AND POSITIVE BASES IN PATTERN SEARCH ALGORITHMS , 1996 .

[6]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[7]  Albert Cohen,et al.  Quick and Practical Run-Time Evaluation of Multiple Program Optimizations , 2007, Trans. High Perform. Embed. Archit. Compil..

[8]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[9]  Ian Foster,et al.  Parallel Spectral Transform Shallow Water Model: a runtime-tunable parallel benchmark code , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[10]  Sally A. McKee,et al.  Methods of inference and learning for performance modeling of parallel applications , 2007, PPoPP.

[11]  R. C. Malone,et al.  Parallel ocean general circulation modeling , 1992 .

[12]  Francine Berman,et al.  Scheduling from the perspective of the application , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[13]  J. Dennis,et al.  Direct Search Methods on Parallel Machines , 1991 .

[14]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[15]  M. Crovella,et al.  Estimating the Heavy Tail Index from Scaling Properties , 1999 .

[16]  I-Hsin Chung,et al.  Active Harmony: Towards Automated Performance Tuning , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[17]  R. C. Malone,et al.  A Reformulation and Implementation of the Bryan-Cox-Semtner Ocean Model on the Connection Machine , 1993 .

[18]  Ken Kennedy,et al.  Automatic tuning of whole applications using direct search and a performance-based transformation system , 2006, The Journal of Supercomputing.

[19]  Michael W. Trosset,et al.  On the Use of Direct Search Methods for Stochastic Optimization , 2000 .

[20]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[21]  Patrick H. Worley,et al.  Practical performance portability in the Parallel Ocean Program (POP): Research Articles , 2005 .

[22]  Jeffrey S. Vetter,et al.  Autopilot: adaptive control of distributed applications , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[23]  Tamara G. Kolda,et al.  Optimization by Direct Search: New Perspectives on Some Classical and Modern Methods , 2003, SIAM Rev..

[24]  F. Jenko,et al.  Electron temperature gradient turbulence. , 2000, Physical review letters.

[25]  Sally A. McKee,et al.  Predicting parallel application performance via machine learning approaches , 2007, Concurr. Comput. Pract. Exp..

[26]  Jeffrey C. Lagarias,et al.  Convergence Properties of the Nelder-Mead Simplex Method in Low Dimensions , 1998, SIAM J. Optim..

[27]  Peter F. Sweeney,et al.  Multiple page size modeling and optimization , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[28]  Azer Bestavros,et al.  Self-similarity in World Wide Web traffic: evidence and possible causes , 1997, TNET.

[29]  D. Abramson,et al.  An Automatic Design Optimization Tool and its Application to Computational Fluid Dynamics , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[30]  John M. Levesque,et al.  Practical performance portability in the Parallel Ocean Program (POP) , 2005, Concurr. Pract. Exp..

[31]  I-Hsin Chung,et al.  Using Information from Prior Runs to Improve Automated Tuning Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[32]  David I. August,et al.  Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[33]  J.B. Drake,et al.  Performance Tuning and Evaluation of a Parallel Community Climate Model , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[34]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.