Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance

GPUs have become prevalent and more general purpose, but GPU programming remains challenging and time consuming for the majority of programmers. In addition, it is not always clear which codes will benefit from getting ported to GPU. Therefore, having a tool to estimate GPU performance for a piece of code before writing a GPU implementation is highly desirable. To this end, we propose Cross-Architecture Performance Prediction (XAPP), a machine-learning based technique that uses only single-threaded CPU implementation to predict GPU performance. Our paper is built on the two following insights: i) Execution time on GPU is a function of program properties and hardware characteristics. ii) By examining a vast array of previously implemented GPU codes along with their CPU counterparts, we can use established machine learning techniques to learn this correlation between program properties, hardware characteristics and GPU execution time. We use an adaptive two-level machine learning solution. Our results show that our tool is robust and accurate: we achieve 26.9% average error on a set of 24 real-world kernels. We also discuss practical usage scenarios for XAPP.

[1]  M. Pazzani,et al.  Error Reduction through Learning Multiple Descriptions , 1996, Machine Learning.

[2]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[3]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[4]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[5]  Keshav Pingali,et al.  Lonestar: A suite of parallel irregular programs , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[6]  Pedro M. Domingos Knowledge Discovery Via Multiple Models , 1998, Intell. Data Anal..

[7]  Venkatram Vishwanath,et al.  GROPHECY: GPU performance projection from CPU code skeletons , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[9]  Lieven Eeckhout,et al.  Ranking commercial machines through data transposition , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[10]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[11]  Derek Chiou,et al.  GPGPU performance and power estimation using machine learning , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[12]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[13]  Margaret Martonosi,et al.  Stargazer: Automated regression-based GPU design space exploration , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[14]  Erik R. Altman,et al.  Predicting GPU Performance from CPU Runs Using Machine Learning , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[15]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[16]  Sally A. McKee,et al.  Efficiently exploring architectural design spaces via predictive modeling , 2006, ASPLOS XII.

[17]  L. John,et al.  Modeling program resource demand using inherent program characteristics , 2011, PERV.

[18]  David M. Brooks,et al.  Accurate and efficient regression modeling for microarchitectural performance and power prediction , 2006, ASPLOS XII.

[19]  Matthew D. Sinclair,et al.  Porting CMP Benchmarks to GPUs , 2011 .

[20]  Henk Corporaal,et al.  The boat hull model: adapting the roofline model to enable performance prediction for parallel computing , 2012, PPoPP '12.

[21]  David M. Brooks,et al.  Illustrative Design Space Studies with Microarchitectural Regression Models , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[22]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[23]  Berkin Özisikyilmaz,et al.  Efficient system design space exploration using machine learning techniques , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[24]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[25]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[26]  King-Sun Fu,et al.  Sequential Methods in Pattern Recognition and Machine Learning , 2012 .

[27]  Xingfu Wu,et al.  Performance projection of HPC applications using SPEC CFP2006 benchmarks , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[28]  Henk Corporaal,et al.  A modular and parameterisable classification of algorithms , 2011 .

[29]  Saturnino Garcia,et al.  Kremlin: like gprof, but for parallelization , 2011, PPoPP '11.

[30]  Benjamin C. Lee,et al.  Inferred Models for Dynamic and Sparse Hardware-Software Spaces , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[31]  Lieven Eeckhout,et al.  Comparing Benchmarks Using Key Microarchitecture-Independent Characteristics , 2006, 2006 IEEE International Symposium on Workload Characterization.

[32]  Margaret Martonosi,et al.  Starchart: Hardware and software optimization using recursive partitioning regression trees , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[33]  Kapil Vaswani,et al.  Construction and use of linear regression models for processor performance analysis , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[34]  Dmitry Mikushin,et al.  KERNELGEN – A Toolchain for Automatic GPU-centric Applications Porting , 2012 .

[35]  José Hernández-Orallo,et al.  From Ensemble Methods to Comprehensible Models , 2002, Discovery Science.

[36]  Peter D. Turney Technical note: Bias and the quantification of stability , 1995, Machine Learning.

[37]  Scott B. Baden,et al.  Modeling and predicting performance of high performance computing applications on hardware accelerators , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[38]  Kapil Vaswani,et al.  A Predictive Performance Model for Superscalar Processors , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[39]  Hendrik Blockeel,et al.  Seeing the Forest Through the Trees: Learning a Comprehensible Model from an Ensemble , 2007, ECML.

[40]  Wen-mei W. Hwu,et al.  CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.

[41]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[42]  Sudhakar Yalamanchili,et al.  Eiger: A framework for the automated synthesis of statistical performance models , 2012, 2012 19th International Conference on High Performance Computing.

[43]  David I. August,et al.  Automatic Parallelization for GPUs , 2013 .