论文信息 - An Execution Model and Runtime for Heterogeneous Many Core Systems

An Execution Model and Runtime for Heterogeneous Many Core Systems

[1] Hong Jiang,et al. Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3] Vladimir M. Pentkovski,et al. Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.

[4] M. Horowitz,et al. The stream virtual machine , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[5] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[6] Arie E. Kaufman,et al. GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[7] Richard Johnson,et al. The Transmeta Code Morphing#8482; Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, CGO.

[8] Edsger W. Dijkstra,et al. Termination Detection for Diffusing Computations , 1980, Inf. Process. Lett..

[9] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[10] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[11] D. Burger,et al. Billion-Transistor Architectures , 1997, Computer.

[12] Timothy Mattson,et al. A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[13] Noah Treuhaft,et al. Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[14] Sudhakar Yalamanchili,et al. A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[15] Margaret Martonosi,et al. Characterizing and improving the performance of Intel Threading Building Blocks , 2008, 2008 IEEE International Symposium on Workload Characterization.

[16] J. Xu. OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[17] Scott A. Mahlke,et al. Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.

[18] Kunle Olukotun,et al. The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[19] Bruce D'Amora,et al. High-performance server systems and the next generation of online games , 2006, IBM Syst. J..

[20] Pat Hanrahan,et al. Interactive k-d tree GPU raytracing , 2007, SI3D.

[21] Samuel Naffziger,et al. An x86-64 core implemented in 32nm SOI CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[22] David M. Brooks,et al. Accurate and efficient regression modeling for microarchitectural performance and power prediction , 2006, ASPLOS XII.

[23] John Gough,et al. Technical Overview of the Common Language Runtime , 2001 .

[24] Sam S. Stone,et al. MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores , 2011 .

[25] Sudhakar Yalamanchili,et al. Modeling GPU-CPU workloads and systems , 2010, GPGPU-3.

[26] Scott A. Mahlke,et al. MacroSS: macro-SIMDization of streaming applications , 2010, ASPLOS XV.

[27] Xin David Zhang,et al. A Streaming Computation Framework for the Cell Processor , 2007 .

[28] H. Spaanenburg,et al. Multi-core/tile Polymorphous Computing systems , 2008, 2008 1st International Conference on Information Technology.

[29] Rohit Chandra,et al. Parallel programming in openMP , 2000 .

[30] P. Sadayappan,et al. Optimal loop unrolling for GPGPU programs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[31] Richard Johnson,et al. The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[32] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[33] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[34] Michael I. Gordon,et al. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[35] R. Hookway. DIGITAL FX!32 running 32-Bit x86 applications on Alpha NT , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.

[36] Eric Darve,et al. N-Body simulation on GPUs , 2006, SC.

[37] Mike Murphy,et al. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.

[38] Manish Vachharajani,et al. GPU acceleration of numerical weather prediction , 2008, IPDPS.

[39] Kevin Skadron,et al. Increasing memory miss tolerance for SIMD cores , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[40] Dean M. Tullsen,et al. Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[41] Eric Rotenberg,et al. Architectural contesting: exposing and exploiting temperamental behavior , 2007, CARN.

[42] William Thies,et al. A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[43] Tom R. Halfhill. NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[44] Maurice Herlihy,et al. Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[45] Jason Cong,et al. High-performance CUDA kernel execution on FPGAs , 2009, ICS.

[46] Vivek Sarkar,et al. The Jikes Research Virtual Machine project: Building an open-source research community , 2005, IBM Syst. J..

[47] Abhishek Udupa,et al. Software Pipelined Execution of Stream Programs on GPUs , 2009, 2009 International Symposium on Code Generation and Optimization.

[48] Kunle Olukotun,et al. Rationale, Design and Performance of the Hydra Multiprocessor , 1994 .

[49] Peter S. Pacheco. Parallel programming with MPI , 1996 .

[50] Trevor Mudge,et al. MacroSS: macro-SIMDization of streaming applications , 2010, ASPLOS 2010.

[51] James E. Smith,et al. Trace Processors: Moving to Fourth-Generation Microarchitectures , 1997, Computer.

[52] Gregory Diamos,et al. Exploring The Latency and Bandwidth Tolerance of CUDA Applications , 2011 .

[53] Richard W. Vuduc,et al. Direct N-body Kernels for Multicore Platforms , 2009, 2009 International Conference on Parallel Processing.

[54] Bradford Nichols,et al. Pthreads programming , 1996 .

[55] Yale N. Patt,et al. One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[56] Grigori Fursin,et al. Predictive Runtime Code Scheduling for Heterogeneous Architectures , 2008, HiPEAC.

[57] Lizy Kurian John,et al. Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[58] Gurindar S. Sohi,et al. Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[59] Rudolf Eigenmann,et al. OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[60] Nicholas Nethercote,et al. Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[61] Hyesoon Kim,et al. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[62] Sudhakar Yalamanchili,et al. Speculative execution on multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[63] Wolfgang Straßer,et al. Parallel volume rendering on a single-chip SIMD architecture , 2001, Proceedings IEEE 2001 Symposium on Parallel and Large-Data Visualization and Graphics (Cat. No.01EX520).

[64] Jostein R. Natvig,et al. Solving the Euler Equations on Graphics Processing Units , 2006, International Conference on Computational Science.

[65] Gregory Diamos,et al. Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.

[66] Renu Vig,et al. Efficient Implementation of AES Algorithm in FPGA Device , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[67] John D. Owens,et al. Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[68] P. Hanrahan,et al. Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[69] Yunfei Chen,et al. GPU accelerated molecular dynamics simulation of thermal conductivities , 2007, J. Comput. Phys..

[70] William Thies,et al. StreamIt: A Language for Streaming Applications , 2002, CC.

[71] K. Mani Chandy,et al. Termination Detection of Diffusing Computations in Communicating Sequential Processes , 1982, TOPL.

[72] Vivek Sarkar,et al. Baring it all to Software: The Raw Machine , 1997 .

[73] Laxmikant V. Kalé,et al. CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.