A Survey of Performance Modeling and Simulation Techniques for Accelerator-Based Computing

The high performance computing landscape is shifting from collections of homogeneous nodes towards heterogeneous systems, in which nodes consist of a combination of traditional out-of-order execution cores and accelerator devices. Accelerators, built around GPUs, many-core chips, FPGAs or DSPs, are used to offload compute-intensive tasks. The advent of this type of systems has brought about a wide and diverse ecosystem of development platforms, optimization tools and performance analysis frameworks. This is a review of the state-of-the-art in performance tools for heterogeneous computing, focusing on the most popular families of accelerators: GPUs and Intel's Xeon Phi. We describe current heterogeneous systems and the development frameworks and tools that can be used for developing for them. The core of this survey is a review of the performance models and tools, including simulators, proposed in the literature for these platforms.

[1]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Wen-mei W. Hwu,et al.  Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors , 2012, PPoPP '12.

[3]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[4]  John R. Rice,et al.  Solving elliptic problems using ELLPACK , 1985, Springer series in computational mathematics.

[5]  Rezaur Rahman Intel® Xeon Phi™ Coprocessor Architecture and Tools , 2013, Apress.

[6]  Michael Klemm,et al.  OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison , 2012, MARC@RWTH.

[7]  Henk Corporaal,et al.  The boat hull model: enabling performance prediction for parallel computing prior to code development , 2012, CF '12.

[8]  Michael C. Doggett,et al.  Auto-tuning interactive ray tracing using an analytical GPU architecture model , 2012, GPGPU-5.

[9]  Yue Wang,et al.  An Instruction-Level Energy Estimation and Optimization Methodology for GPU , 2011, 2011 IEEE 11th International Conference on Computer and Information Technology.

[10]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[11]  Venkatram Vishwanath,et al.  GROPHECY: GPU performance projection from CPU code skeletons , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Majid Sarrafzadeh,et al.  Energy-aware high performance computing with graphic processing units , 2008, CLUSTER 2008.

[13]  David M. Brooks,et al.  Energy characterization and instruction-level energy model of Intel's Xeon Phi processor , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[14]  Jianliang Xu,et al.  GPURoofline: A Model for Guiding Performance Optimizations on GPUs , 2012, Euro-Par.

[15]  Satoshi Matsuoka,et al.  Statistical power modeling of GPU kernels using performance counters , 2010, International Conference on Green Computing.

[16]  Sudhakar Yalamanchili,et al.  Modeling GPU-CPU workloads and systems , 2010, GPGPU-3.

[17]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[18]  K. Srinathan,et al.  A performance prediction model for the CUDA GPGPU platform , 2009, 2009 International Conference on High Performance Computing (HiPC).

[19]  Emmett Kilgariff,et al.  Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[20]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[21]  Alfonso Niño,et al.  A Survey of Parallel Programming Models and Tools in the Multi and Many-core Era , 2022 .

[22]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[23]  Kevin Skadron,et al.  BenchFriend: Correlating the performance of GPU benchmarks , 2014, Int. J. High Perform. Comput. Appl..

[24]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[25]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[26]  David Defour,et al.  Barra: A Parallel Functional Simulator for GPGPU , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[27]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  Shuaiwen Song,et al.  A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[29]  Hiroaki Kobayashi,et al.  A History-Based Performance Prediction Model with Profile Data Classification for Automatic Task Allocation in Heterogeneous Computing Systems , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[30]  Sabela Ramos,et al.  Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.

[31]  Vitaly Zakharenko,et al.  FusionSim: Characterizing the Performance Benefits of Fused CPU/GPU Systems , 2012 .

[32]  Xiaohan Ma,et al.  Statistical Power Consumption Analysis and Modeling for GPU-based Computing , 2011 .

[33]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[34]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[35]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[36]  Ben H. H. Juurlink,et al.  How a single chip causes massive power bills GPUSimPow: A GPGPU power simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[37]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[38]  André Seznec,et al.  Break down GPU execution time with an analytical method , 2012, RAPIDO '12.

[39]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[40]  Nicolas Brunie,et al.  Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[41]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[42]  Mohak Shah,et al.  Evaluating Learning Algorithms: A Classification Perspective , 2011 .

[43]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[44]  Rezaur Rahman,et al.  Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers , 2013 .

[45]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[46]  Xiaohan Ma,et al.  Improving Energy Efficiency of GPU based General-Purpose Scientific Computing through Automated Selection of Near Optimal Configurations , 2011 .

[47]  Tao Li,et al.  Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[48]  Scott B. Baden,et al.  Redefining the Role of the CPU in the Era of CPU-GPU Integration , 2012, IEEE Micro.

[49]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[50]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[51]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[52]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[53]  Yun Liang,et al.  An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[54]  Laxmi N. Bhuyan,et al.  A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures , 2013, TACO.

[55]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[56]  González García,et al.  Modelo de estimación de rendimiento para arquitecturas paralelas heterogéneas , 2013 .

[57]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.