The Design and Implementation Ocelot’s Dynamic Binary Translator from PTX to Multi-Core x86

Ocelot is a dynamic compilation framework designed to map the explicitly parallel PTX execution model used by NVIDIA CUDA applications onto diverse many-core architectures. Ocelot includes a dynamic binary translator from PTX to many-core processors that leverages the LLVM code generator to target x86. The binary translator is able to execute CUDA applications without recompilation and Ocelot can in fact dynamically switch between execution on an NVIDIA GPU and a many-core CPU. It has been validated against over 100 applications taken from the CUDA SDK [1], the UIUC Parboil benchmarks [2], the Virginia Rodinia benchmarks [3], the GPUVSIPL signal and image processing library [4], and several domain specific applications. This paper presents a detailed description of the implementation of our binary translator highlighting design decisions and trade-offs, and showcasing their effect on application performance. We explore several code transformations that are applicable only when translating explicitly parallel applications and suggest additional optimization passes that may be useful to this class of applications. We expect this study to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures.

[1]  Pedro López,et al.  Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[2]  Andrew Kerr,et al.  Translating GPU Binaries to Tiered SIMD Architectures with Ocelot , 2009 .

[3]  David Parello,et al.  Barra, a Modular Functional GPU Simulator for GPGPU , 2009 .

[4]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[5]  Gregory Diamos,et al.  Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.

[6]  Sam S. Stone,et al.  MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores , 2011 .

[7]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[8]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[9]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[10]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[11]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[12]  Scott A. Mahlke,et al.  Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[13]  R. Hookway DIGITAL FX!32 running 32-Bit x86 applications on Alpha NT , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.

[14]  Cheng Wang,et al.  StarDBT: An Efficient Multi-platform Dynamic Binary Translation System , 2007, Asia-Pacific Computer Systems Architecture Conference.

[15]  Mike Murphy,et al.  Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.

[16]  Simha Sethumadhavan,et al.  Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[17]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[18]  Margaret Martonosi,et al.  Characterizing and improving the performance of Intel Threading Building Blocks , 2008, 2008 IEEE International Symposium on Workload Characterization.

[19]  Jason Cong,et al.  High-performance CUDA kernel execution on FPGAs , 2009, ICS.

[20]  Vivek Sarkar,et al.  The Jikes Research Virtual Machine project: Building an open-source research community , 2005, IBM Syst. J..

[21]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[22]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[23]  Owen Liu AMD technology: power, performance and the future , 2007, China HPC.

[24]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[25]  John D. Owens,et al.  Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[26]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[27]  Sanjay J. Patel,et al.  Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[28]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[29]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[30]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31]  Vijay Janapa Reddi,et al.  PIN: a binary instrumentation tool for computer architecture research and education , 2004, WCAE '04.