Address Translation for Throughput-Oriented Accelerators

With processor vendors embracing hardware heterogeneity, providing low overhead hardware and software abstractions to support easy-to-use programming models is a critical problem. In this context, this work sets the foundation for designing memory management units (MMUs) for GPUs in CPU/GPU systems, the key mechanism necessary to support the increasingly important unified address space paradigm in heterogeneous systems.

[1]  Mike O'Connor,et al.  Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[2]  Daniel J. Sorin,et al.  Evaluating cache coherent shared virtual memory for heterogeneous multicore chips , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[3]  Karthikeyan Sankaralingam,et al.  iGPU: Exception support and speculative execution on GPUs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[4]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[5]  Abhishek Bhattacharjee,et al.  Large-reach memory management unit caches , 2013, MICRO.

[6]  Alan L. Cox,et al.  Practical, transparent operating system support for superpages , 2002, OPSR.

[7]  David A. Wood,et al.  Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[8]  Sang Lyul Min,et al.  U-cache: a cost-effective solution to synonym problem , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[9]  Hyesoon Kim Supporting virtual memory in GPGPU without supporting precise exceptions , 2012, MSPC '12.

[10]  Margaret Martonosi,et al.  Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors , 2010, ASPLOS 2010.

[11]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[12]  Aamer Jaleel,et al.  CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[13]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[14]  Muli Ben-Yehuda,et al.  rIOMMU: Efficient IOMMU for I/O Devices that Employ Ring Buffers , 2015, ASPLOS.

[15]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[16]  Margaret Martonosi,et al.  Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[17]  Aamer Jaleel,et al.  In-line interrupt handling for software-managed TLBs , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[18]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[20]  Abhishek Bhattacharjee,et al.  Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.

[21]  Alan L. Cox,et al.  SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[22]  Abhishek Bhattacharjee,et al.  Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2013, ASPLOS.

[23]  Mike O'Connor,et al.  Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[24]  Muli Ben-Yehuda,et al.  IOMMU: strategies for mitigating the IOTLB bottleneck , 2010, ISCA'10.

[25]  Gil Neiger,et al.  Intel ® Virtualization Technology for Directed I/O , 2006 .