Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
暂无分享,去创建一个
[1] Gary A. Kildall,et al. A unified approach to global program optimization , 1973, POPL.
[2] Constantine D. Polychronopoulos,et al. Fast barrier synchronization hardware , 1990, Proceedings SUPERCOMPUTING '90.
[3] Mark N. Wegman,et al. Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.
[4] Alvin M. Despain,et al. Cache design trade-offs for power and performance optimization: a case study , 1995, ISLPED '95.
[5] R. P. Colwell,et al. A 0.6 /spl mu/m BiCMOS processor with dynamic execution , 1995, Proceedings ISSCC '95 - International Solid-State Circuits Conference.
[6] Robert S. Cohn,et al. Hot cold optimization of large Windows/NT applications , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.
[7] James E. Smith,et al. Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.
[8] Alex Rapaport,et al. Mpi-2: extensions to the message-passing interface , 1997 .
[9] John Yates,et al. FX!32 a profile-directed binary translator , 1998, IEEE Micro.
[10] S. F. Hummel,et al. Implementing Jalape~ No in Java , 1999 .
[11] James R. Larus,et al. Cache-conscious structure definition , 1999, PLDI '99.
[12] Rohit Chandra,et al. Parallel programming in openMP , 2000 .
[13] Michael Hind,et al. Which pointer analysis should I use? , 2000, ISSTA '00.
[14] Michael Gschwind,et al. Dynamic Binary Translation and Optimization , 2001, IEEE Trans. Computers.
[15] Kenneth Moreland,et al. The FFT on a GPU , 2003, HWWS '03.
[16] Pat Conway,et al. The AMD Opteron Processor for Multiprocessor Servers , 2003, IEEE Micro.
[17] Chris Lattner,et al. Data Structure Analysis: A Fast and Scalable Context-Sensitive Heap Analysis , 2003 .
[18] Benjamin C. Pierce,et al. Advanced Topics In Types And Programming Languages , 2004 .
[19] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..
[20] Pat Hanrahan,et al. Ray tracing on programmable graphics hardware , 2002, SIGGRAPH Courses.
[21] Dinesh Manocha,et al. Fast computation of database operations using graphics processors , 2005, SIGGRAPH Courses.
[22] Rüdiger Westermann,et al. Linear algebra operators for GPU implementation of numerical algorithms , 2003, SIGGRAPH Courses.
[23] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.
[24] H. Peter Hofstee,et al. Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..
[25] Carla Schlatter Ellis,et al. Algorithms for parallel memory allocation , 1989, International Journal of Parallel Programming.
[26] P. Hanrahan,et al. Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).
[27] D. Campbell. VSIPL++ Acceleration Using Commodity Graphics Processors , 2006, 2006 HPCMP Users Group Conference (HPCMP-UGC'06).
[28] Eric Darve,et al. N-Body simulation on GPUs , 2006, SC.
[29] Bo Han,et al. Efficient video decoding on GPUs by point based rendering , 2006, GH '06.
[30] Timothy Johnson,et al. An 8-core, 64-thread, 64-bit power efficient sparc soc (niagara2) , 2007, ISPD '07.
[31] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[32] Ian Buck,et al. GPU computing with NVIDIA CUDA , 2007, SIGGRAPH Courses.
[33] Sanjay J. Patel,et al. Implicitly Parallel Programming Models for Thousand-Core Microprocessors , 2007, 2007 44th ACM/IEEE Design Automation Conference.
[34] Yunfei Chen,et al. GPU accelerated molecular dynamics simulation of thermal conductivities , 2007, J. Comput. Phys..
[35] Yao Zhang,et al. Scan primitives for GPU computing , 2007, GH '07.
[36] Teresa H. Y. Meng,et al. Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.
[37] Ramani Duraiswami,et al. Fast multipole methods on graphics processors , 2008, J. Comput. Phys..
[38] Bingsheng He,et al. Relational joins on graphics processors , 2008, SIGMOD Conference.
[39] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[40] Gregory Diamos,et al. Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.
[41] Philippas Tsigas,et al. A Practical Quicksort Algorithm for Graphics Processors , 2008, ESA.
[42] Wilson W. L. Fung,et al. Dynamic warp formation : exploiting thread scheduling for efficient MIMD control flow on SIMD graphics hardware , 2008 .
[43] Patrick Horain,et al. GpuCV: an opensource GPU-accelerated framework forimage processing and computer vision , 2008, ACM Multimedia.
[44] Fang Liu,et al. Characterizing and modeling the behavior of context switch misses! , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[45] Edward T. Grochowski,et al. Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[46] Vasanth Bala,et al. Dynamo: a transparent dynamic optimization system , 2000, SIGP.
[47] Sam S. Stone,et al. MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores , 2011 .