An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers

High-level synthesis (HLS) tools have made significant progress in compiling high-level descriptions of computation into highly pipelined register-transfer level specifications. The high-throughput computation raises a high data demand. To prevent data accesses from being the bottleneck, on-chip memories are used as data reuse buffers to reduce off-chip accesses. Also memory partitioning is explored to increase the memory bandwidth by scheduling multiple simultaneous memory accesses to different memory banks. Prior work on memory partitioning of data reuse buffers is limited to uniform partitioning. In this paper, we perform an early-stage exploration of nonuniform memory partitioning. We use the stencil computation, a popular communication-intensive application domain, as a case study to show the potential benefits of nonuniform memory partitioning. Our novel method can always achieve the minimum memory size and the minimum number of memory banks, which cannot be guaranteed in any prior work. We develop a generalized microarchitecture to decouple stencil accesses from computation, and an automated design flow to integrate our microarchitecture with the HLS-generated computation kernel for a complete accelerator.

[1]  Jason Cong,et al.  High-Level Synthesis for FPGAs: From Prototyping to Deployment , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[2]  Jason Cong,et al.  Memory partitioning and scheduling co-optimization in behavioral synthesis , 2012, 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[3]  Todor Stefanov,et al.  pn: A Tool for Improved Derivation of Process Networks , 2007, EURASIP J. Embed. Syst..

[4]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[5]  Jason Cong,et al.  Polyhedral-based data reuse optimization for configurable computing , 2013, FPGA '13.

[6]  Marc Snir,et al.  GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .

[7]  David Atienza,et al.  A high-level synthesis flow for the implementation of iterative stencil loop algorithms on FPGA devices , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[8]  Samuel H. Fuller,et al.  The Future of Computing Performance: Game Over or Next Level? , 2014 .

[9]  Jason Cong,et al.  Optimizing memory hierarchy allocation with loop transformations for high-level synthesis , 2012, DAC Design Automation Conference 2012.

[10]  Jason Cong,et al.  An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[11]  Jason Cong,et al.  CMOST: A system-level FPGA compilation framework , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[12]  Jason Cong,et al.  Automatic memory partitioning and scheduling for throughput and power optimization , 1999, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[13]  Yosi Ben-Asher,et al.  Automatic Memory Partitioning: Increasing memory parallelism via data structure partitioning , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[14]  Jason Cong,et al.  Customizable Domain-Specific Computing , 2009, IEEE Design & Test of Computers.

[15]  FeautrierPaul Some efficient solutions to the affine scheduling problem , 1992 .

[16]  Jason Cong,et al.  Accelerator-rich CMPs: From concept to real hardware , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[17]  Jason Cong,et al.  Memory partitioning for multidimensional arrays in high-level synthesis , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[18]  J. Ramanujam,et al.  Compile-Time Techniques for Data Distribution in Distributed Memory Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[19]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[20]  Jason Cong,et al.  An integrated and automated memory optimization flow for FPGA behavioral synthesis , 2012, 17th Asia and South Pacific Design Automation Conference.

[21]  Jason Cong,et al.  Improving high level synthesis optimization opportunity through polyhedral transformations , 2013, FPGA '13.