Translating GPU Binaries to Tiered SIMD Architectures with Ocelot

Parallel Thread Execution ISA (PTX) is a virtual instruction set used by NVIDIA GPUs that explicitly expresses hierarchical MIMD and SIMD style parallelism in an application. In such a programming model, the programmer and compiler are left with the not trivial, but not impossible, task of composing applications from parallel algorithms and data structures. Once this has been accomplished, even simple architectures with low hardware complexity can easily exploit the parallelism in an application. With these applications in mind, this paper presents Ocelot, a binary translation framework designed to allow architectures other than NVIDIA GPUs to leverage the parallelism in PTX programs. Specifically, we show how (i) the PTX thread hierarchy can be mapped to many-core architectures, (ii) translation techniques can be used to hide memory latency, and (iii) GPU data structures can be efficiently emulated or mapped to native equivalents. We describe the low level implementation of our translator, ending with a case study detailing the complete translation process from PTX to SPU assembly used by the IBM Cell Processor.

[1]  Gary A. Kildall,et al.  A unified approach to global program optimization , 1973, POPL.

[2]  Constantine D. Polychronopoulos,et al.  Fast barrier synchronization hardware , 1990, Proceedings SUPERCOMPUTING '90.

[3]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[4]  Alvin M. Despain,et al.  Cache design trade-offs for power and performance optimization: a case study , 1995, ISLPED '95.

[5]  R. P. Colwell,et al.  A 0.6 /spl mu/m BiCMOS processor with dynamic execution , 1995, Proceedings ISSCC '95 - International Solid-State Circuits Conference.

[6]  Robert S. Cohn,et al.  Hot cold optimization of large Windows/NT applications , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[7]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[8]  Alex Rapaport,et al.  Mpi-2: extensions to the message-passing interface , 1997 .

[9]  John Yates,et al.  FX!32 a profile-directed binary translator , 1998, IEEE Micro.

[10]  S. F. Hummel,et al.  Implementing Jalape~ No in Java , 1999 .

[11]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[12]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[13]  Michael Hind,et al.  Which pointer analysis should I use? , 2000, ISSTA '00.

[14]  Michael Gschwind,et al.  Dynamic Binary Translation and Optimization , 2001, IEEE Trans. Computers.

[15]  Kenneth Moreland,et al.  The FFT on a GPU , 2003, HWWS '03.

[16]  Pat Conway,et al.  The AMD Opteron Processor for Multiprocessor Servers , 2003, IEEE Micro.

[17]  Chris Lattner,et al.  Data Structure Analysis: A Fast and Scalable Context-Sensitive Heap Analysis , 2003 .

[18]  Benjamin C. Pierce,et al.  Advanced Topics In Types And Programming Languages , 2004 .

[19]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[20]  Pat Hanrahan,et al.  Ray tracing on programmable graphics hardware , 2002, SIGGRAPH Courses.

[21]  Dinesh Manocha,et al.  Fast computation of database operations using graphics processors , 2005, SIGGRAPH Courses.

[22]  Rüdiger Westermann,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, SIGGRAPH Courses.

[23]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[24]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[25]  Carla Schlatter Ellis,et al.  Algorithms for parallel memory allocation , 1989, International Journal of Parallel Programming.

[26]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[27]  D. Campbell VSIPL++ Acceleration Using Commodity Graphics Processors , 2006, 2006 HPCMP Users Group Conference (HPCMP-UGC'06).

[28]  Eric Darve,et al.  N-Body simulation on GPUs , 2006, SC.

[29]  Bo Han,et al.  Efficient video decoding on GPUs by point based rendering , 2006, GH '06.

[30]  Timothy Johnson,et al.  An 8-core, 64-thread, 64-bit power efficient sparc soc (niagara2) , 2007, ISPD '07.

[31]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[32]  Ian Buck,et al.  GPU computing with NVIDIA CUDA , 2007, SIGGRAPH Courses.

[33]  Sanjay J. Patel,et al.  Implicitly Parallel Programming Models for Thousand-Core Microprocessors , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[34]  Yunfei Chen,et al.  GPU accelerated molecular dynamics simulation of thermal conductivities , 2007, J. Comput. Phys..

[35]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[36]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[37]  Ramani Duraiswami,et al.  Fast multipole methods on graphics processors , 2008, J. Comput. Phys..

[38]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[39]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[40]  Gregory Diamos,et al.  Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.

[41]  Philippas Tsigas,et al.  A Practical Quicksort Algorithm for Graphics Processors , 2008, ESA.

[42]  Wilson W. L. Fung,et al.  Dynamic warp formation : exploiting thread scheduling for efficient MIMD control flow on SIMD graphics hardware , 2008 .

[43]  Patrick Horain,et al.  GpuCV: an opensource GPU-accelerated framework forimage processing and computer vision , 2008, ACM Multimedia.

[44]  Fang Liu,et al.  Characterizing and modeling the behavior of context switch misses! , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[45]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[46]  Vasanth Bala,et al.  Dynamo: a transparent dynamic optimization system , 2000, SIGP.

[47]  Sam S. Stone,et al.  MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores , 2011 .