CGPredict: Embedded GPU Performance Estimation from Single-Threaded Applications

Heterogeneous multiprocessor system-on-chip architectures are endowed with accelerators such as embedded GPUs and FPGAs capable of general-purpose computation. The application developers for such platforms need to carefully choose the accelerator with the maximum performance benefit. For a given application, usually, the reference code is specified in a high-level single-threaded programming language such as C. The performance of an application kernel on an accelerator is a complex interplay among the exposed parallelism, the compiler, and the accelerator architecture. Thus, determining the performance of a kernel requires its redevelopment into each accelerator-specific language, causing substantial wastage of time and effort. To aid the developer in this early design decision, we present an analytical framework CGPredict to predict the performance of a computational kernel on an embedded GPU architecture from un-optimized, single-threaded C code. The analytical approach provides insights on application characteristics which suggest further application-specific optimizations. The estimation error is as low as 2.66% (average 9%) compared to the performance of the same kernel written in native CUDA code running on NVIDIA Kepler embedded GPU. This low performance estimation error enables CGPredict to provide an early design recommendation of the accelerator starting from C code.

[1]  Xiaojin Zhu,et al.  Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Yun Liang,et al.  Design Space exploration of FPGA-based accelerators with multi-level parallelism , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[3]  Yun Liang,et al.  Instruction cache locking using temporal reuse profile , 2010, Design Automation Conference.

[4]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[5]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[6]  Yun Liang,et al.  Lin-Analyzer: A high-level performance analysis tool for FPGA-based accelerators , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[7]  S. M. García,et al.  2014: , 2020, A Party for Lazarus.

[8]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[9]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[10]  TUNING CUDA APPLICATIONS FOR KEPLER , 2017 .

[11]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[12]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[13]  Arun Parakh,et al.  Performance Estimation of GPUs with Cache , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[14]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[15]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[16]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[17]  Venkatram Vishwanath,et al.  GROPHECY: GPU performance projection from CPU code skeletons , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Tao Tang,et al.  Cache Miss Analysis for GPU Programs Based on Stack Distance Profile , 2011, 2011 31st International Conference on Distributed Computing Systems.

[19]  Tom Feist,et al.  Vivado Design Suite , 2012 .

[20]  Derek Chiou,et al.  GPGPU performance and power estimation using machine learning , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[21]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .