Supporting Address Translation for Accelerator-Centric Architectures

While emerging accelerator-centric architectures offer orders-of-magnitude performance and energy improvements, use cases and adoption can be limited by their rigid programming model. A unified virtual address space between the host CPU cores and customized accelerators can largely improve the programmability, which necessitates hardware support for address translation. However, supporting address translation for customized accelerators with low overhead is nontrivial. Prior studies either assume an infinite-sized TLB and zero page walk latency, or rely on a slow IOMMU for correctness and safety—which penalizes the overall system performance. To provide efficient address translation support for accelerator-centric architectures, we examine the memory access behavior of customized accelerators to drive the TLB augmentation and MMU designs. First, to support bulk transfers of consecutive data between the scratchpad memory of customized accelerators and the memory system, we present a relatively small private TLB design to provide low-latency caching of translations to each accelerator. Second, to compensate for the effects of the widely used data tiling techniques, we design a shared level-two TLB to serve private TLB misses on common virtual pages, eliminating duplicate page walks from accelerators working on neighboring data tiles that are mapped to the same physical page. This two-level TLB design effectively reduces page walks by 75.8% on average. Finally, instead of implementing a dedicated MMU which introduces additional hardware complexity, we propose simply leveraging the host per-core MMU for efficient page walk handling. This mechanism is based on our insight that the existing MMU cache in the CPU MMU satisfies the demand of customized accelerators with minimal overhead. Our evaluation demonstrates that the combined approach incurs only a 6.4% performance overhead compared to the ideal address translation.

[1]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[2]  Per Hammarlund,et al.  4th generation Intel™ Core processor, codenamed Haswell , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).

[3]  Jason Cong,et al.  Architecture support for accelerator-rich CMPs , 2012, DAC Design Automation Conference 2012.

[4]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[5]  Chen-Yong Cher,et al.  A wire-speed powerTM processor: 2.3GHz 45nm SOI with 16 cores and 64 threads , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[6]  David L. Black,et al.  Translation lookaside buffer consistency: a software approach , 1989, ASPLOS III.

[7]  Abhishek Bhattacharjee,et al.  Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.

[8]  David A. Wood,et al.  A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Babak Falsafi,et al.  Meet the walkers accelerating index traversals for in-memory databases , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Ming Yang,et al.  Sonic Millip3De: A massively parallel 3D-stacked accelerator for 3D ultrasound , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[11]  Scott A. Mahlke,et al.  Polymorphic Pipeline Array: A flexible multicore accelerator with virtualized execution for mobile multimedia applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[13]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[14]  G. Kandiraju,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[15]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  Srilatha Manne,et al.  Accelerating two-dimensional page walks for virtualized systems , 2008, ASPLOS.

[17]  Luis Ceze,et al.  Neural Acceleration for General-Purpose Approximate Programs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[18]  Stephen Neuendorffer,et al.  Building zynq® accelerators with Vivado® high level synthesis , 2013, FPGA '13.

[19]  Jason Cong,et al.  Architecture Support for Domain-Specific Accelerator-Rich CMPs , 2014, ACM Trans. Embed. Comput. Syst..

[20]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[21]  Jason Cong,et al.  A quantitative analysis on microarchitectures of modern CPU-FPGA platforms , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[22]  Osman S. Unsal,et al.  Energy-efficient address translation , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[23]  Christoforos E. Kozyrakis,et al.  Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[24]  Mark Silberstein,et al.  ActivePointers: A Case for Software Address Translation on GPUs , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[25]  Jason Cong,et al.  PARADE: A cycle-accurate full-system simulation Platform for Accelerator-Rich Architectural Design and Exploration , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[26]  Jacob Nelson,et al.  SNNAP: Approximate computing on programmable SoCs via neural acceleration , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[27]  Eric S. Chung,et al.  LINQits: big data on little clients , 2013, ISCA.

[28]  Trevor N. Mudge,et al.  A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.

[29]  Thomas A. Ziaja,et al.  Sparc T4: A Dynamically Threaded Server-on-a-Chip , 2012, IEEE Micro.

[30]  Gu-Yeon Wei,et al.  Co-designing accelerators and SoC interfaces using gem5-Aladdin , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[32]  Hari Angepat,et al.  An FPGA-based In-Line Accelerator for Memcached , 2014, IEEE Computer Architecture Letters.

[33]  David A. Wood,et al.  Border control: Sandboxing accelerators , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Per Stenström,et al.  Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[35]  Abhishek Bhattacharjee,et al.  Large-reach memory management unit caches , 2013, MICRO.

[36]  Margaret Martonosi,et al.  Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[37]  David A. Wood,et al.  Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[38]  Christoforos E. Kozyrakis,et al.  Convolution engine , 2015, Commun. ACM.

[39]  Alan L. Cox,et al.  Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[40]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[41]  Karthikeyan Sankaralingam,et al.  DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing , 2012, IEEE Micro.

[42]  Trevor N. Mudge,et al.  Software-managed address translation , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[43]  Muli Ben-Yehuda,et al.  IOMMU: strategies for mitigating the IOTLB bottleneck , 2010, ISCA'10.

[44]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[45]  Muli Ben-Yehuda,et al.  rIOMMU: Efficient IOMMU for I/O Devices that Employ Ring Buffers , 2015, ASPLOS.

[46]  Collin McCurdy,et al.  Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors , 2008, ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software.

[47]  Thomas F. Wenisch,et al.  Thin servers with smart pipes: designing SoC accelerators for memcached , 2013, ISCA.

[48]  Anil Krishna,et al.  Hardware acceleration in the IBM PowerEN processor: architecture and performance , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[49]  Gu-Yeon Wei,et al.  Toward Cache-Friendly Hardware Accelerators , 2015 .