Automatic cache partitioning method for high-level synthesis

Abstract Existing algorithms can be automatically translated from software to hardware using High-Level Synthesis (HLS), allowing for quick prototyping or deployment of embedded designs. High-level software is written with a single main memory in mind, whereas hardware designs can take advantage of many parallel memories. The translation and optimization of memory usage, and the generation of resulting architectures, is important for high-performance designs. Tools provide optimizations on memory structures targeting data reuse and partitioning, but generally these are applied separately for a given object in memory. Memory access that cannot be effectively optimized is serialized to the memory, hindering any further parallelization of the surrounding generated hardware. In this work, we present an automated optimization method for creating custom cache memory architectures for HLS generated designs. Our optimization uses runtime profiling data, and is performed at a localized scope. This method combines data reuse savings and memory partitioning to further increase the potential parallelism and alleviate the serialized memory access, increasing performance. Comparisons are made against architectures without this optimization, and against other HLS caching approaches. Results are presented showing this method requires 72% of the number of execution cycles compared to a single-cache design, and 31% compared to designs with no caches.

[1]  Jason Cong,et al.  Combined loop transformation and hierarchy allocation for data reuse optimization , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[2]  Yu Ting Chen,et al.  Automated generation of banked memory architectures in the high-level synthesis of multi-threaded software , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[3]  Erik Brockmeyer,et al.  Data reuse analysis technique for software-controlled memory hierarchies , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[4]  Zhiru Zhang,et al.  A New Approach to Automatic Memory Banking using Trace-Based Address Mining , 2017, FPGA.

[5]  Jason Cong,et al.  An integrated and automated memory optimization flow for FPGA behavioral synthesis , 2012, 17th Asia and South Pacific Design Automation Conference.

[6]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[7]  Jason Cong,et al.  Optimizing memory hierarchy allocation with loop transformations for high-level synthesis , 2012, DAC Design Automation Conference 2012.

[8]  George A. Constantinides,et al.  Custom-sized caches in application-specific memory hierarchies , 2015, 2015 International Conference on Field Programmable Technology (FPT).

[9]  Jason Cong,et al.  Automatic memory partitioning and scheduling for throughput and power optimization , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[10]  John Wawrzynek,et al.  Exploiting Memory-Level Parallelism in Reconfigurable Accelerators , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[11]  Ralph Wittig,et al.  Performance and power of cache-based reconfigurable computing , 2009, ISCA '09.

[12]  George A. Constantinides,et al.  Custom Multicache Architectures for Heap Manipulating Programs , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[13]  Alessandro Cilardo,et al.  Area implications of memory partitioning for high-level synthesis on FPGAs , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[14]  Jason Cong,et al.  Theory and algorithm for generalized memory partitioning in high-level synthesis , 2014, FPGA.

[15]  Jason Cong,et al.  Polyhedral-based data reuse optimization for configurable computing , 2013, FPGA '13.

[16]  George A. Constantinides,et al.  Separation Logic for High-Level Synthesis , 2015, ACM Trans. Reconfigurable Technol. Syst..

[17]  Qiang Liu,et al.  Automatic On-chip Memory Minimization for Data Reuse , 2007 .