Address Translation for Throughput-Oriented Accelerators
暂无分享,去创建一个
Abhishek Bhattacharjee | Lisa R. Hsu | Bharath Pichai | Lisa Hsu | A. Bhattacharjee | Bharath Pichai
[1] Mike O'Connor,et al. Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.
[2] Daniel J. Sorin,et al. Evaluating cache coherent shared virtual memory for heterogeneous multicore chips , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[3] Karthikeyan Sankaralingam,et al. iGPU: Exception support and speculative execution on GPUs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[4] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .
[5] Abhishek Bhattacharjee,et al. Large-reach memory management unit caches , 2013, MICRO.
[6] Alan L. Cox,et al. Practical, transparent operating system support for superpages , 2002, OPSR.
[7] David A. Wood,et al. Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[8] Sang Lyul Min,et al. U-cache: a cost-effective solution to synonym problem , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.
[9] Hyesoon Kim. Supporting virtual memory in GPGPU without supporting precise exceptions , 2012, MSPC '12.
[10] Margaret Martonosi,et al. Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors , 2010, ASPLOS 2010.
[11] Michael M. Swift,et al. Efficient virtual memory for big memory servers , 2013, ISCA.
[12] Aamer Jaleel,et al. CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[13] Mike O'Connor,et al. Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[14] Muli Ben-Yehuda,et al. rIOMMU: Efficient IOMMU for I/O Devices that Employ Ring Buffers , 2015, ASPLOS.
[15] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[16] Margaret Martonosi,et al. Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[17] Aamer Jaleel,et al. In-line interrupt handling for software-managed TLBs , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.
[18] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[19] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[20] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.
[21] Alan L. Cox,et al. SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[22] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2013, ASPLOS.
[23] Mike O'Connor,et al. Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[24] Muli Ben-Yehuda,et al. IOMMU: strategies for mitigating the IOTLB bottleneck , 2010, ISCA'10.
[25] Gil Neiger,et al. Intel ® Virtualization Technology for Directed I/O , 2006 .