Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems
暂无分享,去创建一个
[1] Vivek Sarkar,et al. Languages and Compilers for Parallel Computing , 1994, Lecture Notes in Computer Science.
[2] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .
[3] J. Ramanujam,et al. A methodology for parallelizing programs for multicomputers and complex memory multiprocessors , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).
[4] Rudolf Eigenmann,et al. OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Francky Catthoor,et al. Polyhedral parallel code generation for CUDA , 2013, TACO.
[6] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..
[7] Michael F. P. O'Boyle,et al. Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.
[8] Michael F. P. O'Boyle,et al. Integrating profile-driven parallelism detection and machine-learning-based mapping , 2014, TACO.
[9] J. Ramanujam,et al. Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.
[10] Wen-mei W. Hwu,et al. DL: A data layout transformation system for heterogeneous computing , 2012, 2012 Innovative Parallel Computing (InPar).
[11] Keith D. Cooper,et al. Optimizing for reduced code space using genetic algorithms , 1999, LCTES '99.
[12] Kevin Skadron,et al. Dymaxion: Optimizing memory access patterns for heterogeneous systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[13] Rudolf Eigenmann,et al. OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.
[14] Hyesoon Kim,et al. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[15] Peng Wu,et al. Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.
[16] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[17] Feng Liu,et al. Dynamically managed data for CPU-GPU architectures , 2012, CGO '12.
[18] Michael Wolfe,et al. Implementing the PGI Accelerator model , 2010, GPGPU-3.
[19] Lars Karlsson,et al. Blocked in-place transposition with application to storage format conversion , 2009 .
[20] Michael F. P. O'Boyle,et al. Smart, adaptive mapping of parallelism in the presence of external workload , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[21] Bjarne Steensgaard,et al. Points-to analysis in almost linear time , 1996, POPL '96.
[22] WangZheng,et al. Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems , 2014 .
[23] Jungwon Kim,et al. Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.
[24] Michael F. P. O'Boyle,et al. Portable and Transparent Host-Device Communication Optimization for GPGPU Environments , 2014, CGO '14.
[25] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[26] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[27] Michael F. P. O'Boyle,et al. OpenCL Task Partitioning in the Presence of GPU Contention , 2013, LCPC.
[28] Richard W. Vuduc,et al. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[29] Jeff S. Brantley,et al. Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems , 2010 .
[30] Lars Karlsson,et al. Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion , 2012, TOMS.
[31] Scott A. Mahlke,et al. Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.
[32] Michael F. P. O'Boyle,et al. Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms , 2014, 2014 21st International Conference on High Performance Computing (HiPC).
[33] Michael F. P. O'Boyle,et al. Partitioning streaming parallelism for multi-cores: A machine learning based approach , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[34] Keith D. Cooper,et al. Register promotion in C programs , 1997, PLDI '97.
[35] Zheng Wang,et al. Fast Automatic Heuristic Construction Using Active Learning , 2014, LCPC.
[36] Michael F. P. O'Boyle,et al. A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.
[37] Collin McCurdy,et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.
[38] Ronan Keryell,et al. Par4All: From Convex Array Regions to Heterogeneous Computing , 2012, HiPEAC 2012.
[39] Hugh Leather,et al. MaSiF: Machine learning guided auto-tuning of parallel skeletons , 2013, 20th Annual International Conference on High Performance Computing.
[40] Jaejin Lee,et al. Performance characterization of the NAS Parallel Benchmarks in OpenCL , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).
[41] David I. August,et al. Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.
[42] Michael F. P. O'Boyle,et al. Using machine learning to partition streaming programs , 2013, ACM Trans. Archit. Code Optim..
[43] Mike Murphy,et al. CUDA: Compiling and optimizing for a GPU platform , 2012, ICCS.
[44] Barbara M. Chapman,et al. Exploiting global optimizations for openmp programs in the openuh compiler , 2009, PPoPP '09.
[45] FrankeBjörn,et al. Integrating profile-driven parallelism detection and machine-learning-based mapping , 2014 .
[46] Wen-mei W. Hwu,et al. CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.
[47] FrankeBjörn,et al. Towards a holistic approach to auto-parallelization , 2009 .
[48] D. Woolley. The White Paper. , 1972, British medical journal.
[49] D. K. Arvind,et al. Languages and Compilers for Parallel Computing , 2014, Lecture Notes in Computer Science.
[50] Michael F. P. O'Boyle,et al. A workload-aware mapping approach for data-parallel programs , 2011, HiPEAC.
[51] Michael F. P. O'Boyle,et al. Exploitation of GPUs for the Parallelisation of Probably Parallel Legacy Code , 2014, CC.
[52] Juan Gómez-Luna,et al. In-place transposition of rectangular matrices on accelerators , 2014, PPoPP '14.
[53] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.
[54] Wen-mei W. Hwu,et al. Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications , 2010, International Journal of Parallel Programming.
[55] Michael F. P. O'Boyle,et al. Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.
[56] Uday Bondhugula,et al. Believe it or not! multi-core CPUs can match GPU performance for a FLOP-intensive application! , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[57] Michael F. P. O'Boyle,et al. Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[58] Pradeep Dubey,et al. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.
[59] Richard W. Vuduc,et al. A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.
[60] Mike Murphy,et al. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.