A Technique for Finding Optimal Program Launch Parameters Targeting Manycore Accelerators

In this paper, we present a new technique to dynamically determine the values of program parameters in order to optimize the performance of a multithreaded program P. To be precise, we describe a novel technique to statically build another program, say, R, that can dynamically determine the optimal values of program parameters to yield the best program performance for P given values for its data and hardware parameters. While this technique can be applied to parallel programs in general, we are particularly interested in programs targeting manycore accelerators. Our technique has successfully been employed for GPU kernels using the MWP-CWP performance model for CUDA.

[1]  Allen D. Malony,et al.  Autotuning GPU Kernels via Static and Predictive Analysis , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[2]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[3]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[4]  Wen-mei W. Hwu,et al.  Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.

[5]  Changbo Chen,et al.  Basic Polynomial Algebra Subprograms , 2015, ACCA.

[6]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[7]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[8]  J. Little A Proof for the Queuing Formula: L = λW , 1961 .

[9]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[10]  Vasily Volkov,et al.  Understanding Latency Hiding on GPUs , 2016 .

[11]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[12]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[13]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[14]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Marc Moreno Maza,et al.  A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads , 2014, PARCO.

[16]  Jacqueline Chame,et al.  A script-based autotuning compiler system to generate high-performance CUDA code , 2013, TACO.

[17]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[18]  K. Chung,et al.  On Lattices Admitting Unique Lagrange Interpolations , 1977 .

[19]  Lin Ma,et al.  A Memory Access Model for Highly-threaded Many-core Architectures , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[20]  Phillip B. Gibbons A more practical PRAM model , 1989, SPAA '89.

[21]  Peter J. Olver,et al.  OnMultivariate Interpolation , 2003 .

[22]  Robert M. Corless,et al.  A Graduate Introduction to Numerical Methods , 2013 .

[23]  Hsien-Hsin S. Lee,et al.  GPUMech: GPU Performance Modeling Technique Based on Interval Analysis , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[24]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[25]  Vasily Volkov A microbenchmark to study GPU performance models , 2018, PPOPP.

[26]  H. Hong An improvement of the projection operator in cylindrical algebraic decomposition , 1990, ISSAC '90.

[27]  Bernhard Beckermann,et al.  The condition number of real Vandermonde, Krylov and positive definite Hankel matrices , 2000, Numerische Mathematik.

[28]  Todd J Martínez,et al.  Automated Code Engine for Graphical Processing Units: Application to the Effective Core Potential Integrals and Gradients. , 2016, Journal of chemical theory and computation.

[29]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[30]  Alex Brandt,et al.  High Performance Sparse Multivariate Polynomials: Fundamental Data Structures and Algorithms , 2018 .

[31]  Michael Franz,et al.  Continuous program optimization: A case study , 2003, TOPL.

[32]  José M. F. Moura,et al.  Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Alogorithms , 2004, Int. J. High Perform. Comput. Appl..

[33]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[34]  Uzi Vishkin,et al.  Simulation of Parallel Random Access Machines by Circuits , 1984, SIAM J. Comput..

[35]  D. Eisenbud Commutative Algebra: with a View Toward Algebraic Geometry , 1995 .