论文信息 - A Survey of Performance Modeling and Simulation Techniques for Accelerator-Based Computing

A Survey of Performance Modeling and Simulation Techniques for Accelerator-Based Computing

The high performance computing landscape is shifting from collections of homogeneous nodes towards heterogeneous systems, in which nodes consist of a combination of traditional out-of-order execution cores and accelerator devices. Accelerators, built around GPUs, many-core chips, FPGAs or DSPs, are used to offload compute-intensive tasks. The advent of this type of systems has brought about a wide and diverse ecosystem of development platforms, optimization tools and performance analysis frameworks. This is a review of the state-of-the-art in performance tools for heterogeneous computing, focusing on the most popular families of accelerators: GPUs and Intel's Xeon Phi. We describe current heterogeneous systems and the development frameworks and tools that can be used for developing for them. The core of this survey is a review of the performance models and tools, including simulators, proposed in the literature for these platforms.

[1] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2] Wen-mei W. Hwu,et al. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors , 2012, PPoPP '12.

[3] William J. Dally,et al. The GPU Computing Era , 2010, IEEE Micro.

[4] John R. Rice,et al. Solving elliptic problems using ELLPACK , 1985, Springer series in computational mathematics.

[5] Rezaur Rahman. Intel® Xeon Phi™ Coprocessor Architecture and Tools , 2013, Apress.

[6] Michael Klemm,et al. OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison , 2012, MARC@RWTH.

[7] Henk Corporaal,et al. The boat hull model: enabling performance prediction for parallel computing prior to code development , 2012, CF '12.

[8] Michael C. Doggett,et al. Auto-tuning interactive ray tracing using an analytical GPU architecture model , 2012, GPGPU-5.

[9] Yue Wang,et al. An Instruction-Level Energy Estimation and Optimization Methodology for GPU , 2011, 2011 IEEE 11th International Conference on Computer and Information Technology.

[10] William Gropp,et al. An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[11] Venkatram Vishwanath,et al. GROPHECY: GPU performance projection from CPU code skeletons , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12] Majid Sarrafzadeh,et al. Energy-aware high performance computing with graphic processing units , 2008, CLUSTER 2008.

[13] David M. Brooks,et al. Energy characterization and instruction-level energy model of Intel's Xeon Phi processor , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[14] Jianliang Xu,et al. GPURoofline: A Model for Guiding Performance Optimizations on GPUs , 2012, Euro-Par.

[15] Satoshi Matsuoka,et al. Statistical power modeling of GPU kernels using performance counters , 2010, International Conference on Green Computing.

[16] Sudhakar Yalamanchili,et al. Modeling GPU-CPU workloads and systems , 2010, GPGPU-3.

[17] Nam Sung Kim,et al. GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[18] K. Srinathan,et al. A performance prediction model for the CUDA GPGPU platform , 2009, 2009 International Conference on High Performance Computing (HiPC).

[19] Emmett Kilgariff,et al. Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[20] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[21] Alfonso Niño,et al. A Survey of Parallel Programming Models and Tools in the Multi and Many-core Era , 2022 .

[22] Murat Efe Guney,et al. On the limits of GPU acceleration , 2010 .

[23] Kevin Skadron,et al. BenchFriend: Correlating the performance of GPU benchmarks , 2014, Int. J. High Perform. Comput. Appl..

[24] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[25] Kevin Skadron,et al. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[26] David Defour,et al. Barra: A Parallel Functional Simulator for GPGPU , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[27] David R. Kaeli,et al. Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28] Shuaiwen Song,et al. A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[29] Hiroaki Kobayashi,et al. A History-Based Performance Prediction Model with Profile Data Classification for Automatic Task Allocation in Heterogeneous Computing Systems , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[30] Sabela Ramos,et al. Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.

[31] Vitaly Zakharenko,et al. FusionSim: Characterizing the Performance Benefits of Fused CPU/GPU Systems , 2012 .

[32] Xiaohan Ma,et al. Statistical Power Consumption Analysis and Modeling for GPU-based Computing , 2011 .

[33] James Reinders,et al. Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[34] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[35] Richard W. Vuduc,et al. Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[36] Ben H. H. Juurlink,et al. How a single chip causes massive power bills GPUSimPow: A GPGPU power simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[37] Joe D. Warren,et al. The program dependence graph and its use in optimization , 1987, TOPL.

[38] André Seznec,et al. Break down GPU execution time with an analytical method , 2012, RAPIDO '12.

[39] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[40] Nicolas Brunie,et al. Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[41] Richard W. Vuduc,et al. A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[42] Mohak Shah,et al. Evaluating Learning Algorithms: A Classification Perspective , 2011 .

[43] Jens H. Krüger,et al. A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[44] Rezaur Rahman,et al. Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers , 2013 .

[45] Yao Zhang,et al. A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[46] Xiaohan Ma,et al. Improving Energy Efficiency of GPU based General-Purpose Scientific Computing through Automated Selection of Near Optimal Configurations , 2011 .

[47] Tao Li,et al. Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[48] Scott B. Baden,et al. Redefining the Role of the CPU in the Era of CPU-GPU Integration , 2012, IEEE Micro.

[49] Michael F. P. O'Boyle,et al. A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[50] Juha Reunanen,et al. Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[51] Ray Jain,et al. The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[52] Hyesoon Kim,et al. An integrated GPU power and performance model , 2010, ISCA.

[53] Yun Liang,et al. An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[54] Laxmi N. Bhuyan,et al. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures , 2013, TACO.

[55] Sudhakar Yalamanchili,et al. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[56] González García,et al. Modelo de estimación de rendimiento para arquitecturas paralelas heterogéneas , 2013 .

[57] Pradeep Dubey,et al. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.