Efficient data supply for hardware accelerators with prefetching and access/execute decoupling

This paper presents an architecture framework to easily design hardware accelerators that can effectively tolerate long and variable memory latency using prefetching and access/execute decoupling. Hardware accelerators are becoming increasingly popular in modern computing systems as a promising approach to achieve higher performance and energy efficiency when technology scaling is slowing down. However, today's high-performance accelerators require significant manual efforts to design, in large part due to the need to carefully orchestrate data transfers between external memory and an accelerator. Instead, the proposed framework utilizes automated program analysis along with High-Level Synthesis (HLS) tools to enable prefetching and access/execute decoupling with minimal manual efforts. The framework adds tags to accelerator memory accesses so that hardware prefetching can effectively preload data for accesses with regular patterns. To handle irregular memory accesses, the framework generates an accelerator with decoupled access/execute architecture using program slicing. Experimental results show that the proposed optimizations can significantly improve performance of HLS-generated accelerators (average speedup of 2.28x across eight accelerators) and often reduce energy consumption (average of 15%).

[1]  Kermin Fleming,et al.  Leap scratchpads: automatic memory and cache management for reconfigurable logic , 2010, FPGA '11.

[2]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[3]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[4]  Thomas F. Wenisch,et al.  Thin servers with smart pipes: designing SoC accelerators for memcached , 2013, ISCA.

[5]  John Wawrzynek,et al.  Architectural synthesis of computational pipelines with decoupled memory access , 2014, 2014 International Conference on Field-Programmable Technology (FPT).

[6]  Jeff Mason,et al.  CHiMPS: A C-level compilation flow for hybrid CPU-FPGA architectures , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[7]  Kenneth A. Ross,et al.  Navigating big data with high-throughput, energy-efficient data partitioning , 2013, ISCA.

[8]  Christopher Batten,et al.  PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[10]  Karthikeyan Sankaralingam,et al.  Dynamically Specialized Datapaths for energy efficient computing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[11]  Eric S. Chung,et al.  LINQits: big data on little clients , 2013, ISCA.

[12]  Tao Chen,et al.  Execution time prediction for energy-efficient hardware accelerators , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Kermin Fleming,et al.  Optimizing under abstraction: Using prefetching to improve FPGA performance , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[14]  Zhiru Zhang,et al.  Multithreaded pipeline synthesis for data-parallel kernels , 2014, 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[15]  John Paul Shen,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[16]  Nigel P. Topham,et al.  A comparison of data prefetching on an access decoupled and superscalar machine , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[17]  Gu-Yeon Wei,et al.  MachSuite: Benchmarks for accelerator design and customized architectures , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[18]  Karthikeyan Sankaralingam,et al.  Efficient execution of memory access phases using dataflow specialization , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[19]  Thomas F. Wenisch,et al.  Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[20]  James C. Hoe,et al.  CoRAM: an in-fabric memory architecture for FPGA-based computing , 2011, FPGA '11.

[21]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[22]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[23]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[24]  Margaret Martonosi,et al.  DeSC: Decoupled supply-compute communication management for heterogeneous architectures , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[26]  Gu-Yeon Wei,et al.  Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[27]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[28]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2005, IEEE Micro.

[29]  Zhiru Zhang,et al.  Flushing-enabled loop pipelining for high-level synthesis , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[30]  Ralph Wittig,et al.  Performance and power of cache-based reconfigurable computing , 2009, FPGA '09.

[31]  Feng Liu,et al.  CGPA: Coarse-Grained Pipelined Accelerators , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).