An Execution Model and Runtime for Heterogeneous Many Core Systems

[1]  Hong Jiang,et al.  Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3]  Vladimir M. Pentkovski,et al.  Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.

[4]  M. Horowitz,et al.  The stream virtual machine , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[5]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[6]  Arie E. Kaufman,et al.  GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[7]  Richard Johnson,et al.  The Transmeta Code Morphing#8482; Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, CGO.

[8]  Edsger W. Dijkstra,et al.  Termination Detection for Diffusing Computations , 1980, Inf. Process. Lett..

[9]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[10]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[11]  D. Burger,et al.  Billion-Transistor Architectures , 1997, Computer.

[12]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[13]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[14]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[15]  Margaret Martonosi,et al.  Characterizing and improving the performance of Intel Threading Building Blocks , 2008, 2008 IEEE International Symposium on Workload Characterization.

[16]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[17]  Scott A. Mahlke,et al.  Orchestrating the execution of stream programs on multicore platforms , 2008, PLDI '08.

[18]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[19]  Bruce D'Amora,et al.  High-performance server systems and the next generation of online games , 2006, IBM Syst. J..

[20]  Pat Hanrahan,et al.  Interactive k-d tree GPU raytracing , 2007, SI3D.

[21]  Samuel Naffziger,et al.  An x86-64 core implemented in 32nm SOI CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[22]  David M. Brooks,et al.  Accurate and efficient regression modeling for microarchitectural performance and power prediction , 2006, ASPLOS XII.

[23]  John Gough,et al.  Technical Overview of the Common Language Runtime , 2001 .

[24]  Sam S. Stone,et al.  MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores , 2011 .

[25]  Sudhakar Yalamanchili,et al.  Modeling GPU-CPU workloads and systems , 2010, GPGPU-3.

[26]  Scott A. Mahlke,et al.  MacroSS: macro-SIMDization of streaming applications , 2010, ASPLOS XV.

[27]  Xin David Zhang,et al.  A Streaming Computation Framework for the Cell Processor , 2007 .

[28]  H. Spaanenburg,et al.  Multi-core/tile Polymorphous Computing systems , 2008, 2008 1st International Conference on Information Technology.

[29]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[30]  P. Sadayappan,et al.  Optimal loop unrolling for GPGPU programs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[31]  Richard Johnson,et al.  The Transmeta Code Morphing/spl trade/ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[32]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[33]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[34]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[35]  R. Hookway DIGITAL FX!32 running 32-Bit x86 applications on Alpha NT , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.

[36]  Eric Darve,et al.  N-Body simulation on GPUs , 2006, SC.

[37]  Mike Murphy,et al.  Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.

[38]  Manish Vachharajani,et al.  GPU acceleration of numerical weather prediction , 2008, IPDPS.

[39]  Kevin Skadron,et al.  Increasing memory miss tolerance for SIMD cores , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[40]  Dean M. Tullsen,et al.  Simultaneous multithreading: a platform for next-generation processors , 1997, IEEE Micro.

[41]  Eric Rotenberg,et al.  Architectural contesting: exposing and exploiting temperamental behavior , 2007, CARN.

[42]  William Thies,et al.  A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[43]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[44]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[45]  Jason Cong,et al.  High-performance CUDA kernel execution on FPGAs , 2009, ICS.

[46]  Vivek Sarkar,et al.  The Jikes Research Virtual Machine project: Building an open-source research community , 2005, IBM Syst. J..

[47]  Abhishek Udupa,et al.  Software Pipelined Execution of Stream Programs on GPUs , 2009, 2009 International Symposium on Code Generation and Optimization.

[48]  Kunle Olukotun,et al.  Rationale, Design and Performance of the Hydra Multiprocessor , 1994 .

[49]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[50]  Trevor Mudge,et al.  MacroSS: macro-SIMDization of streaming applications , 2010, ASPLOS 2010.

[51]  James E. Smith,et al.  Trace Processors: Moving to Fourth-Generation Microarchitectures , 1997, Computer.

[52]  Gregory Diamos,et al.  Exploring The Latency and Bandwidth Tolerance of CUDA Applications , 2011 .

[53]  Richard W. Vuduc,et al.  Direct N-body Kernels for Multicore Platforms , 2009, 2009 International Conference on Parallel Processing.

[54]  Bradford Nichols,et al.  Pthreads programming , 1996 .

[55]  Yale N. Patt,et al.  One Billion Transistors, One Uniprocessor, One Chip , 1997, Computer.

[56]  Grigori Fursin,et al.  Predictive Runtime Code Scheduling for Heterogeneous Architectures , 2008, HiPEAC.

[57]  Lizy Kurian John,et al.  Scaling to the end of silicon with EDGE architectures , 2004, Computer.

[58]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[59]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[60]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[61]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[62]  Sudhakar Yalamanchili,et al.  Speculative execution on multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[63]  Wolfgang Straßer,et al.  Parallel volume rendering on a single-chip SIMD architecture , 2001, Proceedings IEEE 2001 Symposium on Parallel and Large-Data Visualization and Graphics (Cat. No.01EX520).

[64]  Jostein R. Natvig,et al.  Solving the Euler Equations on Graphics Processing Units , 2006, International Conference on Computational Science.

[65]  Gregory Diamos,et al.  Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.

[66]  Renu Vig,et al.  Efficient Implementation of AES Algorithm in FPGA Device , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[67]  John D. Owens,et al.  Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[68]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[69]  Yunfei Chen,et al.  GPU accelerated molecular dynamics simulation of thermal conductivities , 2007, J. Comput. Phys..

[70]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[71]  K. Mani Chandy,et al.  Termination Detection of Diffusing Computations in Communicating Sequential Processes , 1982, TOPL.

[72]  Vivek Sarkar,et al.  Baring it all to Software: The Raw Machine , 1997 .

[73]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.