NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs

Binary instrumentation frameworks are widely used to implement profilers, performance evaluation, error checking, and bug detection tools. While dynamic binary instrumentation tools such as PIN and DynamoRio are supported on CPUs, GPU architectures currently only have limited support for similar capabilities through static compile-time tools, which prohibits instrumentation of dynamically loaded libraries that are foundations for modern high-performance applications. This work presents NVBit, a fast, dynamic, and portable, binary instrumentation framework, that allows users to write instrumentation tools in CUDA/C/C++ and selectively apply that functionality to pre-compiled binaries and libraries executing on NVIDIA GPUs. Using dynamic recompilation at the SASS level, NVBit analyzes GPU kernel register requirements to generate efficient ABI compliant instrumented code without requiring the tool developer to have detailed knowledge of the underlying GPU architecture. NVBit allows basic-block instrumentation, multiple function injections to the same location, inspection of all ISA visible state, dynamic selection of instrumented or uninstrumented code, permanent modification of register state, source code correlation, and instruction removal. NVBit supports all recent NVIDIA GPU architecture families including Kepler, Maxwell, Pascal and Volta and works on any pre-compiled CUDA, OpenACC, OpenCL, or CUDA-Fortran application.

[1]  Edward McLellan The Alpha AXP architecture and 21064 processor , 1993, IEEE Micro.

[2]  Robert Hundt,et al.  HP Caliper: a framework for performance analysis tools , 2000, IEEE Concurr..

[3]  Kim M. Hazelwood,et al.  A dynamic binary instrumentation engine for the ARM architecture , 2006, CASES '06.

[4]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[6]  Karsten Schwan,et al.  A framework for dynamically instrumenting GPU compute applications within GPU Ocelot , 2011, GPGPU-4.

[7]  B. Jacob,et al.  CMP $ im : A Pin-Based OnThe-Fly Multi-Core Cache Simulator , 2008 .

[8]  Jin Huang,et al.  Decoding CUDA Binary , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[9]  Jong-Deok Choi,et al.  Accurate, efficient, and adaptive calling context profiling , 2006, PLDI '06.

[10]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[11]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[12]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[13]  Matthias Hauswirth,et al.  Low-overhead memory leak detection using adaptive statistical profiling , 2004, ASPLOS XI.

[14]  Eugenio Culurciello,et al.  An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.

[15]  Sudhakar Yalamanchili,et al.  Modeling GPU-CPU workloads and systems , 2010, GPGPU-3.

[16]  David W. Nellans,et al.  Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[17]  Wu-chun Feng,et al.  Towards a performance-portable FFT library for heterogeneous computing , 2014, Conf. Computing Frontiers.

[18]  Derek Bruening,et al.  Efficient, transparent, and comprehensive runtime code manipulation , 2004 .

[19]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[20]  Christian Terboven,et al.  OpenACC - First Experiences with Real-World Applications , 2012, Euro-Par.

[21]  Yul Chu,et al.  A flexible multi-core functional cache simulator (FM-SIM) , 2017, SummerSim.

[22]  Bronis R. de Supinski,et al.  Abstract: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation , 2013, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[23]  Andrew Kerr,et al.  Translating GPU Binaries to Tiered SIMD Architectures with Ocelot , 2009 .

[24]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[25]  Vijay Janapa Reddi,et al.  PIN: a binary instrumentation tool for computer architecture research and education , 2004, WCAE '04.

[26]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[27]  Larry Rudolph,et al.  How to Do a Million Watchpoints: Efficient Debugging Using Dynamic Instrumentation , 2008, CC.