SemCache: semantics-aware caching for efficient GPU offloading
暂无分享,去创建一个
[1] Of references. , 1966, JAMA.
[2] Bill Nitzberg,et al. Distributed shared memory: a survey of issues and algorithms , 1991, Computer.
[3] Alan L. Cox,et al. TreadMarks: shared memory computing on networks of workstations , 1996 .
[4] Jack Dongarra,et al. LAPACK Users' Guide, 3rd ed. , 1999 .
[5] Yves Robert,et al. A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear Solvers) , 2001, IEEE Trans. Computers.
[6] A. Prakash,et al. A FETI‐based multi‐time‐step coupling method for Newmark schemes in structural dynamics , 2004 .
[7] Tamara G. Kolda,et al. An overview of the Trilinos project , 2005, TOMS.
[8] H. Peter Hofstee,et al. Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..
[9] Teresa H. Y. Meng,et al. Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.
[10] Eduard Ayguadé,et al. Hybrid access-specific software cache techniques for the cell BE architecture , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[11] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[12] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[13] Robert A. van de Geijn,et al. Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.
[14] Eduard Ayguadé,et al. An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.
[15] Hyesoon Kim,et al. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[16] John E. Stone,et al. An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS 2010.
[17] John E. Stone,et al. An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.
[18] Robert A. van de Geijn,et al. Retargeting PLAPACK to clusters with hardware accelerators , 2010, 2010 International Conference on High Performance Computing & Simulation.
[19] Eric J. Kelmelis,et al. CULA: hybrid GPU accelerated linear algebra routines , 2010, Defense + Commercial Sensing.
[20] David I. August,et al. Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.
[21] Bronis R. de Supinski,et al. OpenMP for Accelerators , 2011, IWOMP.
[22] Mark Silberstein,et al. PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.
[23] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..
[24] Jungwon Kim,et al. Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.
[25] Anand Raghunathan,et al. MDR: performance model driven runtime for heterogeneous parallel platforms , 2011, ICS '11.
[26] Arun Chauhan,et al. Automating GPU computing in MATLAB , 2011, ICS '11.
[27] Mickeal Verschoor,et al. Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs , 2012, Parallel Comput..
[28] Martin Uecker,et al. A Multi-GPU Programming Library for Real-Time Applications , 2012, ICA3PP.
[29] Andrew S. Grimshaw,et al. Scalable GPU graph traversal , 2012, PPoPP '12.
[30] Feng Liu,et al. Dynamically managed data for CPU-GPU architectures , 2012, CGO '12.
[31] R. Govindarajan,et al. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[32] Jack J. Dongarra,et al. Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.
[33] Laxmikant V. Kalé,et al. G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems , 2013, ICS '13.
[34] Arun Prakash,et al. Exploiting domain knowledge to optimize parallel computational mechanics codes , 2013, ICS '13.
[35] Daniel Sunderland,et al. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..
[36] Thomas F. Wenisch,et al. Unlocking bandwidth for GPUs in CC-NUMA systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[37] Yi-Ping You,et al. VirtCL: a framework for OpenCL device abstraction and management , 2015, PPoPP.
[38] Milind Kulkarni,et al. SemCache++: Semantics-Aware Caching for Efficient Multi-GPU Offloading , 2015, ICS.