The Design and Implementation Ocelot’s Dynamic Binary Translator from PTX to Multi-Core x86
暂无分享,去创建一个
[1] Pedro López,et al. Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.
[2] Andrew Kerr,et al. Translating GPU Binaries to Tiered SIMD Architectures with Ocelot , 2009 .
[3] David Parello,et al. Barra, a Modular Functional GPU Simulator for GPGPU , 2009 .
[4] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..
[5] Gregory Diamos,et al. Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.
[6] Sam S. Stone,et al. MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores , 2011 .
[7] William J. Dally,et al. Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.
[8] Tom R. Halfhill. NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .
[9] Vasanth Bala,et al. Dynamo: a transparent dynamic optimization system , 2000, SIGP.
[10] Rudolf Eigenmann,et al. OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.
[11] Nicholas Nethercote,et al. Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.
[12] Scott A. Mahlke,et al. Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.
[13] R. Hookway. DIGITAL FX!32 running 32-Bit x86 applications on Alpha NT , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.
[14] Cheng Wang,et al. StarDBT: An Efficient Multi-platform Dynamic Binary Translation System , 2007, Asia-Pacific Computer Systems Architecture Conference.
[15] Mike Murphy,et al. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.
[16] Simha Sethumadhavan,et al. Distributed Microarchitectural Protocols in the TRIPS Prototype Processor , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[17] Sudhakar Yalamanchili,et al. A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[18] Margaret Martonosi,et al. Characterizing and improving the performance of Intel Threading Building Blocks , 2008, 2008 IEEE International Symposium on Workload Characterization.
[19] Jason Cong,et al. High-performance CUDA kernel execution on FPGAs , 2009, ICS.
[20] Vivek Sarkar,et al. The Jikes Research Virtual Machine project: Building an open-source research community , 2005, IBM Syst. J..
[21] Richard Johnson,et al. The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..
[22] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.
[23] Owen Liu. AMD technology: power, performance and the future , 2007, China HPC.
[24] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[25] John D. Owens,et al. Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[26] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[27] Sanjay J. Patel,et al. Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.
[28] Henry Hoffmann,et al. Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..
[29] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[30] Naga K. Govindaraju,et al. Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[31] Vijay Janapa Reddi,et al. PIN: a binary instrumentation tool for computer architecture research and education , 2004, WCAE '04.