Adaptive GPU Array Layout Auto-Tuning

Optimal performance is an important goal in compute intensive applications. For GPU applications, this requires a lot of experience and knowledge about the algorithms and the underlying hardware, making them an ideal target for auto-tuning approaches. We present an auto-tuner which optimizes array layouts in CUDA applications. Depending on the data and program parameters, kernels can have varying optimal configurations. We thus adjust array layouts adaptively at runtime and achieve or even exceed performance of hand optimized code. We automatically detect data characteristics to identify different performance scenarios without user input or additional programming. We perform an empirical analysis of the application in order to construct our decision models. Our adaptive optimization requires in principle profiling data for an extremely high number of scenarios which cannot be exhaustively evaluated for complex applications. We solve this by extending a previously published method that is able to efficiently profile single kernel calls and enhance it to find application-wide optimal solutions. Our method is able to optimize applications in a few minutes, reaching speed ups of up to 20% compared to hand optimized code.

[1]  Amnon Barak,et al.  Memory access patterns: the missing piece of the multi-GPU puzzle , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[3]  Michael Goesele,et al.  Guided profiling for auto-tuning array layouts on GPUs , 2015, PMBS '15.

[4]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[5]  Scott A. Mahlke,et al.  ELF: maximizing memory-level parallelism for GPUs with coordinated warp and fetch scheduling , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[7]  Michael Garland,et al.  Nitro: A Framework for Adaptive Code Variant Tuning , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[8]  Thomas Fahringer,et al.  Automatic Data Layout Optimizations for GPUs , 2015, Euro-Par.

[9]  Xipeng Shen,et al.  A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[10]  Hans Henrik Brandenborg Sørensen,et al.  Auto-tuning Dense Vector and Matrix-Vector Operations for Fermi GPUs , 2011, PPAM.

[11]  I-Hsin Chung,et al.  Using Information from Prior Runs to Improve Automated Tuning Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[12]  David D. Cox,et al.  Machine learning for predictive auto-tuning with boosted regression trees , 2012, 2012 Innovative Parallel Computing (InPar).

[13]  Michael Goesele,et al.  Information-theoretic analysis of molecular (co)evolution using graphics processing units , 2012, ECMLS '12.

[14]  Henk Corporaal,et al.  Adaptive and transparent cache bypassing for GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Frank Mueller,et al.  Auto-generation and auto-tuning of 3D stencil codes on GPU clusters , 2012, CGO '12.

[16]  Robert L. Cook,et al.  The Reyes image rendering architecture , 1987, SIGGRAPH.

[17]  Xipeng Shen,et al.  An Infrastructure for Tackling Input-Sensitivity of GPU Program Optimizations , 2012, International Journal of Parallel Programming.