Enable back memory and global synchronization on LLC buffer
暂无分享,去创建一个
[1] Yi Yang,et al. Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors , 2015, LCPC.
[2] Shinpei Kato,et al. Gdev: First-Class GPU Resource Management in the Operating System , 2012, USENIX Annual Technical Conference.
[3] Pei Yulong,et al. LLC Buffer for Arbitrary Data Sharing in Heterogeneous Systems , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).
[4] Thomas F. Wenisch,et al. Selective GPU caches to eliminate CPU-GPU HW cache coherence , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[5] Vivek Sarkar,et al. Compiling and Optimizing Java 8 Programs for GPU Execution , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[6] Gerhard Fettweis,et al. M3: A Hardware/Operating-System Co-Design to Tame Heterogeneous Manycores , 2016, ASPLOS.
[7] Margaret Martonosi,et al. DeSC: Decoupled supply-compute communication management for heterogeneous architectures , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[8] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[9] Fabien Coelho,et al. Static Compilation Analysis for Host-Accelerator Communication Optimization , 2011, LCPC.
[10] David I. August,et al. Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.
[11] Aaftab Munshi,et al. The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).
[12] Michael F. P. O'Boyle,et al. Portable and Transparent Host-Device Communication Optimization for GPGPU Environments , 2014, CGO '14.
[13] Saman P. Amarasinghe,et al. Portable performance on heterogeneous architectures , 2013, ASPLOS '13.
[14] Martin D. F. Wong,et al. An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.
[15] Michael F. P. O'Boyle,et al. Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems , 2014, ACM Trans. Archit. Code Optim..
[16] Shinpei Kato,et al. Zero-copy I/O processing for low-latency GPU computing , 2013, 2013 ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS).
[17] R. Govindarajan,et al. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[18] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.
[19] Kunle Olukotun,et al. Locality-Aware Mapping of Nested Parallel Patterns on GPUs , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[20] David F. Bacon,et al. Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.
[21] Wu-chun Feng,et al. Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[22] Johann Glaser,et al. Yosys-A Free Verilog Synthesis Suite , 2013 .
[23] Feng Liu,et al. Dynamically managed data for CPU-GPU architectures , 2012, CGO '12.
[24] Vivek Sarkar,et al. Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection , 2015, PPPJ.