Fast Computational GPU Design with GT-Pin

As computational applications become common for graphics processing units, new hardware designs must be developed to meet the unique needs of these workloads. Performance simulation is an important step in appraising how well a candidate design will serve these needs, but unfortunately, computational GPU programs are so large that simulating them in detail is prohibitively slow. This work addresses the need to understand very large computational GPU programs in three ways. First, it introduces a fast tracing tool that uses binary instrumentation for in-depth analyses of native executions on existing architectures. Second, it characterizes 25 commercial and benchmark OpenCL applications, which average 308 billion GPU instructions apiece and are by far the largest benchmarks that have been natively profiled at this level of detail. Third, it accelerates simulation of future hardware by pinpointing small subsets of OpenCL applications that can be simulated as representative surrogates in lieu of full-length programs. Our fast selection method requires no simulation itself and allows the user to navigate the accuracy/simulation speed trade-off space, from extremely accurate with reasonable speedups (35X increase in simulation speed for 0.3% error) to reasonably accurate with extreme speedups (223X simulation speedup for 3.0% error).

[1]  Brad Calder,et al.  Structures for phase classification , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.

[2]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Hai Jin,et al.  GPGPU-MiniBench: Accelerating GPGPU Micro-Architecture Simulation , 2015, IEEE Transactions on Computers.

[4]  Ruppa K. Thulasiram,et al.  Option Pricing on the GPU , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).

[5]  James Cownie,et al.  PinPlay: a framework for deterministic replay and reproducible analysis of parallel programs , 2010, CGO '10.

[6]  Lieven Eeckhout,et al.  BarrierPoint: Sampled simulation of multi-threaded applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[8]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[9]  David Defour,et al.  Barra: A Parallel Functional Simulator for GPGPU , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[10]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[11]  Wolfgang Paul,et al.  GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model , 2009, J. Comput. Phys..

[12]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[13]  Karsten Schwan,et al.  Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[14]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[15]  Won Woo Ro,et al.  Parallel GPU architecture simulation framework exploiting work allocation unit parallelism , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[16]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[17]  Hsien-Hsin S. Lee,et al.  TBPoint: Reducing Simulation Time for Large-Scale GPGPU Kernels , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[18]  David R. Kaeli,et al.  Analyzing program flow within a many-kernel OpenCL application , 2011, GPGPU-4.

[19]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[20]  Steve Mann,et al.  Computer vision signal processing on graphics processing units , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Gagan Agrawal,et al.  A translation system for enabling data mining applications on GPUs , 2009, ICS.

[22]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).