论文信息 - Address Translation for Throughput-Oriented Accelerators

Address Translation for Throughput-Oriented Accelerators

With processor vendors embracing hardware heterogeneity, providing low overhead hardware and software abstractions to support easy-to-use programming models is a critical problem. In this context, this work sets the foundation for designing memory management units (MMUs) for GPUs in CPU/GPU systems, the key mechanism necessary to support the increasingly important unified address space paradigm in heterogeneous systems.

[1] Mike O'Connor,et al. Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[2] Daniel J. Sorin,et al. Evaluating cache coherent shared virtual memory for heterogeneous multicore chips , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[3] Karthikeyan Sankaralingam,et al. iGPU: Exception support and speculative execution on GPUs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[4] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .

[5] Abhishek Bhattacharjee,et al. Large-reach memory management unit caches , 2013, MICRO.

[6] Alan L. Cox,et al. Practical, transparent operating system support for superpages , 2002, OPSR.

[7] David A. Wood,et al. Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[8] Sang Lyul Min,et al. U-cache: a cost-effective solution to synonym problem , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[9] Hyesoon Kim. Supporting virtual memory in GPGPU without supporting precise exceptions , 2012, MSPC '12.

[10] Margaret Martonosi,et al. Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors , 2010, ASPLOS 2010.

[11] Michael M. Swift,et al. Efficient virtual memory for big memory servers , 2013, ISCA.

[12] Aamer Jaleel,et al. CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[13] Mike O'Connor,et al. Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[14] Muli Ben-Yehuda,et al. rIOMMU: Efficient IOMMU for I/O Devices that Employ Ring Buffers , 2015, ASPLOS.

[15] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[16] Margaret Martonosi,et al. Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[17] Aamer Jaleel,et al. In-line interrupt handling for software-managed TLBs , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[18] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[19] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[20] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.

[21] Alan L. Cox,et al. SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[22] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2013, ASPLOS.

[23] Mike O'Connor,et al. Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[24] Muli Ben-Yehuda,et al. IOMMU: strategies for mitigating the IOTLB bottleneck , 2010, ISCA'10.

[25] Gil Neiger,et al. Intel ® Virtualization Technology for Directed I/O , 2006 .